HCV reference data prep questions

sidneymbell commented 2 years ago

Hello there! I've been chatting a bit with @corneliusroemer about this, but figured it was easier to write it all down and ask the group. I'm quite stuck at this point and would love some pointers.

Goal

Enable lineage calling, QC inspection and mutations list for HCV.

Issue

HCV is made up of 7-8 (depending on who you ask) different "genotypes." These vary from one another by up to 30% sequence identity. Within each genotype, you can also have many "subtypes" which vary from one another quite significantly (one review article said 20% but that seems high to me).

Approach

Current plan is to make multiple nextclade datasets to split apart these tasks since a list of mutations or QC metrics of HCV genotype X against a reference sequence from genotype Y is going to be very long and uninformative. A - Run lineage calling only using a pan-HCV tree with annotated genotypes. Leaving subtypes alone for now, reference data is too sparse to do this reliably. B - Bin input sequences by genotype, then run QC and mutation calling against a genotype-specific reference.

Progress

I've tracked down some decent reference sequences for each genotype, cleaned up all publicly available HCV data from NCBI, and made myself a nice little nextstrain build for pan-HCV.

Specific questions

1 - Does the two-step plan outlined above sound reasonable to you? If so, it would be super helpful to get some docs on which input reference data files are required to produce which kind of output.

2 - Relatedly, for Step A above, the docs say that the tree must be rooted on the same sequence used for mutation calling. This makes total sense as I'm assuming you're using the branch mutation annotations to compare to the per-sequence mutations for placement.

I'm not quite sure how to proceed here, though, as for HCV there isn't a great "outgroup" to force a root on per this thread with Ollie. See example tree below, with the assigned outgroup genotype hovered -- using treetime to assign clades to the full tree with dta based on a few annotated representatives per genotype does an OK job even with the clearly suboptimal root being forced, but I'm not sure I totally trust it.

Thoughts?

sidneymbell commented 2 years ago

Most salient question here is really about the mapping between desired output files and the required input files.

E.g., if I just want to generate lineage calls using the grafted tree, do I really need all of the input files? Or is a tree annotated with clade labels at all nodes + aa_muts + a reference genome sufficient? Currently this errors, but I'm curious to understand why.


Error: 
   0: When `--input-dataset` is not specified, the following arguments are required:
      --input-gene-map  
      --input-qc-config  
      --input-pcr-primers  
      --input-virus-properties```

ivan-aksamentov commented 2 years ago

Hi Sidney @sidneymbell,

I cannot help with the sciency bits, so I'll leave it to our scientists, but regarding the

mapping between desired output files and the required input files

You'll definitely want a gene map (genome annotation). Without knowing genes there is no way to do translation and get any of the AA stuff, including AA mutations.

For the rest of the files, you can leave them blank or put some approximate dummy values for now:

PCR primers CSV can be empty (headers might be needed). In this case no primers will be analyzed.
Virus properties JSON: only "schemaVersion": "1.10.0" is required, everything else is specific to different viruses and is optional, so you can have a file with just { "schemaVersion": "1.10.0" } inside and I think it might do.
QC config: same thing, just { "schemaVersion": "1.2.0" }, everything else is optional, but also quite easy to tweak to your needs.

It's probably more convenient to take a dataset directory for an existing virus (e.g. SC2 or MPX) and use it as a skeleton - keep the file names, swap/add/remove the contents of these files, and run with --input-dataset, such that Nextclade discovers all the files using the filename conventions.

You can download a dataset with dataset get command, or to pick one from here: https://github.com/nextstrain/nextclade_data/tree/master/data/datasets

It's very rare that people use these flags, let alone creating datasets, so the workflow is not super straighforward and absolutely undocumented. There are definitely some defects and annoyances, because we did not have resources to put more work into it quite yet. Interesting to hear about your experience and improvement ideas.

ivan-aksamentov commented 2 years ago

Regarding

make multiple nextclade datasets

I cannot tell in terms of science, but in terms of engineering this seems like a correct approach. We do this for monkeypox.

Note that since Nextclade v2.0.0, datasets can have so called "attributes".

We don't quite leverage the full power of this feature for the existing datasets, due to their legacy nature, but theoretically it is possible to build a whole family of datasets, with arbitrary attributes attached to them and even build hierarchies of related datasets. Then when you download a dataset with dataset get, you'd use the --attribute flag. Each attribute has a default. So when not specified a default or a combination of defaults is downloaded.

In fact, name, reference and tag of existing datasets are also attributes. Check out the attributes fields in the dataset server index file: https://data.clades.nextstrain.org/index_v2.json

Right now you need not worry about it, but I just mention this in case there is need for a large number of related datasets. i.e. there is no technical reason to limit this number or to constrain the science and exploration. All these datasets can actually be organized very neatly and ergonomically.

But for starters we can just say there's 8 datasets for each "genotype", and if you want to drill down more, then some of them also subdivide to "subtypes", which make it N datasets. This is totally fine, albeit a huge work to prepare and validate them of course.

sidneymbell commented 2 years ago

Thank you so much, Ivan! This is hugely helpful and should be enough to get me started.

I'll take some notes on where I get stuck during the process and share them. Aspirationally, also happy to send in a small docs PR once I'm done with the process :)

On Mon, Oct 10, 2022 at 11:34 AM Ivan Aksamentov @.***> wrote:

Regarding

make multiple nextclade datasets

I cannot tell in terms of science, but in terms of engineering this seems like a correct approach. We do this for monkeypox.

Note that since Nextclade v2.0.0, datasets can have so called "attributes".

We don't quite use this feature for the existing datasets, due to their legacy nature, but theoretically it is possible to build a whole family of datasets, with arbitrary attributes attached to them and even build hierarchies of related datasets.

Check out the attributes fields in the dataset server index file: https://data.clades.nextstrain.org/index_v2.json

Right now you need not worry about it, but I just mention this in case there is need for a large number of related datasets. There is no technical reason to limit this number or constrain the science otherwise. They can actually be organized very neatly.

But for starters we can just say there's 8 datasets for each "genotype", and if you want to drill down more, then some of them also subdivide to "subtypes", which make it N datasets. This is totally fine, albeit a huge work to prepare and validate of course.

— Reply to this email directly, view it on GitHub https://github.com/nextstrain/nextclade/issues/1014#issuecomment-1273679531, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADAIYX2C6CXUMZLJNVOCGR3WCROT7ANCNFSM6AAAAAARAAAKFU . You are receiving this because you were mentioned.Message ID: @.***>

sidneymbell commented 1 year ago

Hey folks. I've made some significant progress and written up a summary in this PR over on the nextclade-data repo. I'll close this issue so we can continue the conversation over there. Thanks all for your help!

nextstrain / nextclade