nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
213 stars 58 forks source link

Add coverage per CDS to output #1513

Open joverlee521 opened 1 month ago

joverlee521 commented 1 month ago

Motivated by https://github.com/nextstrain/pathogen-repo-guide/issues/50:

It seems like a common pattern for sequencing efforts to focus on specific genes instead of the full genome. It would be helpful for ingest to annotate each record's gene coverage to explore the data.

We are currently doing this in what feels like a very "hacky" way of using the Nextclade translations outputs to calculate gene coverage (e.g. dengue).

Would it possible to directly output gene/CDS coverage from Nextclade?

ivan-aksamentov commented 1 month ago

Can you help me to better understand what needs to be done? The formula or algorithm. Maybe there's some existing code?

Is this on nucleotides or amino acids?

What output format do you imagine?

j23414 commented 1 month ago

Sure thing! I'm currently calculating the percentage based on amino acids, where total gene coverage equals 1.0 and half coverage equals 0.5, with rounding to three significant figures. The code here:

https://github.com/nextstrain/dengue/blob/029702275f1e72eba07c9339185a2feb8eec942e/ingest/scripts/calculate-gene-converage-from-nextclade-translation.py#L39-L41

For an example output, can look at dengue metadata here: https://data.nextstrain.org/files/workflows/dengue/metadata_all.tsv.zst

Screenshot 2024-07-19 at 8 24 14 AM

It felt a little hacky to have a script back calculate from nextclade generated CDS files, so any help is appreciated to streamline the calculation.

ivan-aksamentov commented 1 month ago

@j23414

The word "coverage" is quite overloaded, so it's important to agree on what exactly are we going to do.

If I understand correctly the idea is to calculate, for each translated CDS, the percentage of amino acid positions which are:

  1. within alignment range (e.g. if a part of CDS falls to unsequenced 5' or 3' region - it is considered not covered)
  2. not X
  3. not deletion -
  4. not * stop codon

Is the (3) and (4) actually correct?

Additionally, should insertions be involved here?

The important part is also how you want the result to be presented. I think creating new column for each CDS might be a bit of an overkill for Nextclade in general (e.g. mpox has dozens/hundreds of CDSes which will end up in a very large tsv file). Can we agree on some single-column format? I propose something similar to what we have for aa mutations:

Gene1:0.123,Gene2:0.456,Gene3:0.789

What do you think?

Once I have precise specifications for the feature, I can add it in no time.

corneliusroemer commented 1 month ago

Thanks @ivan-aksamentov for fully specifying 😀 What you write is reasonable and is what I would have distilled from OP as well.

I agree with embedding in a single column with your proposed grammar.

I think we could round to 4 digits after comma to not lose any precision for CDS that are up to around 5k long (e.g. ORF1a in SC2). For dengue 3 makes sense (shorter CDSes) but to generalize, the extra figure is worth it. If we wanted to have fewer pointless characters, we could multiply the resulting number by 10000, saving us 0.

While we're at improving CDS metrics, one thing we have for nucs that might make sense to add to CDSes as well is "alignment range". I'd propose we also add this. Grammar:

S:10-148,N:None

The question here is what to put if a gene is completely missing. Maybe None, maybe NA, or null. I don't mind. But would be nice to have!

ivan-aksamentov commented 1 month ago

@corneliusroemer

Thanks @ivan-aksamentov for fully specifying

These were mostly my questions. I have no idea what I am specifying :)

You think that deletions and stop codons need to be considered not covered? There is nothing bad about stop codons it seems.

If there's an insertion do you count these positions as covered in addition to the normal ones (it might make coverage higher than 100%)

I did not plan to truncate the numbers. It will be read mostly programmatically, right?

ivan-aksamentov commented 1 month ago

For the nucs the current logic is as follows:

https://github.com/nextstrain/nextclade/blob/dc560aa705dff462936151c085857c4451f55524/packages/nextclade/src/run/nextclade_run_one.rs#L151-L153

It does not count either deletions or insertions.

corneliusroemer commented 1 month ago

Indeed we wouldn't count insertions, the way we do for nucs right now. That's partially as insertions can be artefacts and we want to avoid treating them as coverage. So we essentially are looking for aligned bases. You're right that stop codons should count.

ivan-aksamentov commented 1 month ago

I made a prototype, but this needs a thorough scientific review, because I have no idea what I am doing:

Prototype: https://github.com/nextstrain/nextclade/pull/1514 Web preview: https://nextclade-git-feat-aa-coverage-nextstrain.vercel.app/ Download CLI binaries in the "Artifacts" section of the latest build (once its' done): https://github.com/nextstrain/nextclade/actions/workflows/cli.yml?query=branch%3Afeat%2Faa-coverage