soedinglab / spacedust

Discovery of conserved gene clusters in multiple genomes
GNU General Public License v3.0
57 stars 2 forks source link

Passing large numbers of files to createsetdb #5

Open SDmetagenomics opened 1 month ago

SDmetagenomics commented 1 month ago

I would like to run spacedust on a plasmid database. This database has ~60k individual files that represent separate plasmid "genomes". However when I pass the following command to spacedust:

$spacedust createsetdb /individual_faa/*.faa SpacedustDB tmp --threads 18

bash: /shared/software/bin/spacedust: Argument list too long

I receive a bash error that the arguments list is too long. I have tried a number of workarounds to this such as passing an environment variable that contains all the file names...but to no avail

It would be useful if instead of passing a file glob (*), that spacedust createsetdb could instead take a single input file with paths to each of the .faa files needed for db creation. Alternatively if I could create databases in batches and combine them that could be another approach, just not sure if that is supported. Finally, if you have any other suggestions I would be forever greatful.

In terms of the total number of proteins in these plasmid "genomes" it would be quite similar to the 9000 genomes you ran in the spacedust paper since plasmids are much smaller in size. So I think computationally it should be managable just trouble getting all the files in :-)

My Environment