ingest: Standardize steps for adding gene coverage to metadata

joverlee521 commented 1 month ago

It seems like a common pattern for sequencing efforts to focus on specific genes instead of the full genome. It would be helpful for ingest to annotate each record's gene coverage to explore the data.

This was previously done by @j23414 in dengue with https://github.com/nextstrain/dengue/pull/36.

We can add these as standardized steps to the ingest template but one hiccup is it requires running sequences through Nextclade. This is easy if a Nextclade dataset already exists, but not as straightforward if users need to create a Nextclade dataset from scratch.

The minimal Nextclade dataset files for annotating gene coverage

reference FASTA
genome annotation GFF file
pathogen.json

The main stumbling block is figuring out which reference to use (currently ingest does not require a reference) and creating the GFF file. It seems like we should have a comprehensive guide on how to get past these blockers in the template as well.

genehack commented 1 month ago

A simple form of a flowchart for "figure out the reference" would be something like, "Is there a RefSeq entry? If so, use that. If not, do a literature search or consult an expert in the field." (I realize that's not great but I do think this is one of those areas where you kinda actually need to know something about what you're trying to do?)

As for constructing a GFF, there are tools that we could point to? Presumably the most common starting point is going to be a GenBank file; if somebody is trying to start with a completely unannotated FASTA as the reference sequence, again, they're probably going to need more specialized support than we want to provide?

joverlee521 commented 1 month ago

As for constructing a GFF, there are tools that we could point to?

For sure! Richard has a script fro generating the GFF from GenBank accession but I haven't personally tried it. https://github.com/nextstrain/nextclade_dataset_template/blob/sanitize_gff/generate_from_genbank.py

ivan-aksamentov commented 1 month ago

Just a quick clarification/precision: Nextclade technically does not require a GFF annotation - it can run with just reference fasta and a very minimal (almost empty) pathogen.json. Though, of course, without annotation it would not know anything about CDSes and amino acid things.

One idea for allowing faster bootstrapping of projects relying on Nextclade is to also not require annotations by default, where possible. This will end up with less useful analysis, but might encourage new learners and simplify their first steps. Will likely increase complexity of workflows though.

joverlee521 commented 1 month ago

Thanks for the clarification @ivan-aksamentov! I guess I didn't mean a minimum Nextclade dataset, but the minimum files needed to get the gene/CDS coverage, which does require a GFF annotation. I've updated the language in above to be explicit.

nextstrain / pathogen-repo-guide

ingest: Standardize steps for adding gene coverage to metadata #50