gibsonlab / chronostrain

MIT License
5 stars 0 forks source link

questions on docs for make-db --reference #33

Open nick-youngblut opened 1 year ago

nick-youngblut commented 1 year ago

According to the README docs:

The TSV file must contain at least the following columns: Accession, Genus, Species, Strain, ChromosomeLen, SeqPath, GFF. An easy way to do this is using the ncbi-genome-download tool and using our script (link here).

I have a few questions about this input. It would be great to have some clarification on the following:

Also, it appears that the link is missing for link here.

yk23 commented 1 year ago

Hi, sorry about this -- the software is still under development as we change some specifications for the paper we are submitting! (And Ive personally had a major life event recently and will resume work on Aug 1)

In the new version (previewable on the dataset-nt-jax branch), we provide a newer/more streamlined way to create the database, including the usage of MAGs. Changes to be merged into main sometime early August.

yk23 commented 1 year ago

For an example (in the meantime), check out the dataset-nt-jax branch, in particular the klebsiella script 'examples/database/complete_recipes/klebsiella.sh' which will generate an index file as an example.

yk23 commented 1 year ago

Finally, to answer some questions one by one: SeqPath and GffPath are both absolute or relative file paths to a fasta assembly file (e.g. nucleotide contig records im standard FASTA format parsable by BioPython)/GFF3 annotation file respectively. The software should accept GZipped versions for both (if it doesn't then thats a bug that's since been fixed on the experimental branch).

A placeholder accession ID suffices as long as it is unique (just to be safe, use a completely different format from ncbi). As long as your SeqPath is valid, the database indexer will automatically create symbolic links (in the unix sense) that the software will recognize.

A new as of yet undocumented change in the upcoming version is that the GFF3 is optional; the database will initialize (albeit with less meaningful metadata) even if it is not provided.