questions on docs for make-db --reference

gibsonlab / chronostrain

MIT License

5 stars 0 forks source link

questions on docs for make-db --reference #33

Open nick-youngblut opened 1 year ago

nick-youngblut commented 1 year ago

According to the README docs:

The TSV file must contain at least the following columns: Accession, Genus, Species, Strain, ChromosomeLen, SeqPath, GFF. An easy way to do this is using the ncbi-genome-download tool and using our script (link here).

I have a few questions about this input. It would be great to have some clarification on the following:

What is the format for SeqPath and GFF? Is GFF a path to a GFF file?
- If so, it doesn't seem to follow the convention of SeqPath ("Path" in the name), so it's unclear.
Is SeqPath a path to the assembly fasta file (nucleotide)?
Can the fasta file be compressed (gzip or otherwise)?
Can the GFF be compressed (gzip or otherwise)?
What if the genome has no accession (e.g., a newly generate MAG)?
- Can one generate a placeholder accession?
- Is there a require format for the access string?

Also, it appears that the link is missing for link here.

yk23 commented 1 year ago

Hi, sorry about this -- the software is still under development as we change some specifications for the paper we are submitting! (And Ive personally had a major life event recently and will resume work on Aug 1)

In the new version (previewable on the dataset-nt-jax branch), we provide a newer/more streamlined way to create the database, including the usage of MAGs. Changes to be merged into main sometime early August.

yk23 commented 1 year ago

For an example (in the meantime), check out the dataset-nt-jax branch, in particular the klebsiella script 'examples/database/complete_recipes/klebsiella.sh' which will generate an index file as an example.

yk23 commented 1 year ago

Finally, to answer some questions one by one: SeqPath and GffPath are both absolute or relative file paths to a fasta assembly file (e.g. nucleotide contig records im standard FASTA format parsable by BioPython)/GFF3 annotation file respectively. The software should accept GZipped versions for both (if it doesn't then thats a bug that's since been fixed on the experimental branch).

A placeholder accession ID suffices as long as it is unique (just to be safe, use a completely different format from ncbi). As long as your SeqPath is valid, the database indexer will automatically create symbolic links (in the unix sense) that the software will recognize.

A new as of yet undocumented change in the upcoming version is that the GFF3 is optional; the database will initialize (albeit with less meaningful metadata) even if it is not provided.