Open nick-youngblut opened 1 year ago
Hi, sorry about this -- the software is still under development as we change some specifications for the paper we are submitting! (And Ive personally had a major life event recently and will resume work on Aug 1)
In the new version (previewable on the dataset-nt-jax branch), we provide a newer/more streamlined way to create the database, including the usage of MAGs. Changes to be merged into main sometime early August.
For an example (in the meantime), check out the dataset-nt-jax branch, in particular the klebsiella script 'examples/database/complete_recipes/klebsiella.sh
' which will generate an index file as an example.
Finally, to answer some questions one by one: SeqPath and GffPath are both absolute or relative file paths to a fasta assembly file (e.g. nucleotide contig records im standard FASTA format parsable by BioPython)/GFF3 annotation file respectively. The software should accept GZipped versions for both (if it doesn't then thats a bug that's since been fixed on the experimental branch).
A placeholder accession ID suffices as long as it is unique (just to be safe, use a completely different format from ncbi). As long as your SeqPath is valid, the database indexer will automatically create symbolic links (in the unix sense) that the software will recognize.
A new as of yet undocumented change in the upcoming version is that the GFF3 is optional; the database will initialize (albeit with less meaningful metadata) even if it is not provided.
According to the README docs:
I have a few questions about this input. It would be great to have some clarification on the following:
SeqPath
andGFF
? IsGFF
a path to a GFF file?SeqPath
("Path" in the name), so it's unclear.SeqPath
a path to the assembly fasta file (nucleotide)?Also, it appears that the link is missing for
link here
.