jmcbroome / auto-pango-designation

Repository for automated flagging of new lineages for pango designation.
Other
5 stars 0 forks source link

General Feedback [2023-02-19] #194

Open corneliusroemer opened 1 year ago

corneliusroemer commented 1 year ago

When the parent starts with misc this isn't actually usable - one needs to figure out what the real parent actually is. covSpectrum will return no EPI_ISLs due to misc not being a real Pango lineage but a way @AngieHinrichs invented to deal with Usher messiness of certain parts of the tree not following the Pango structure.

To be able to work with the proposed lineage, it'd be very helpful to have the set of mutations leading up to the defining nodes (reversions being cancelled out) so one can put them into covSpectrum with [30-of:...] etc, where one can loosen the definition to see what potential ancestors could be.

For proper analysis, a link to the open tree on Taxonium is not enough, one needs to be able to look at the full Usher tree. What's the best/quickest way to get to that? A few EPI_ISLs would be sufficient to load it with Usher. It'd be best if the Usher tree was already created and could be loaded from a link like (ideally without trash so it doesn't disappear after 1-2 days): https://next.nextstrain.org/fetch/genome.ucsc.edu/trash/ct/subtreeAuspice1_genome_2d8d4_1ea230.json?label=id:node_8309435

There should be one commit per lineage, otherwise it's not easy to cherry-pick/revert single lineages that we don't want to accept from the PR. The commits should have commit messages similar to the current format, something like:

Added lineage XBF.4 via autolin

List of defining nucleotide mutations: A1234T,...
List of defining AA mutations: S:Q613H, ...

Lineage notes additions should be placed below the parent sorted on dealiased lineage names using natural sorting, not alphabetic sorting - in line with current practice.

Lineages must be correctly aliased -> instead of XBB.1.9.2.1, XBB.1.9.2 needs to be given an alias and then it's EG.1 etc

Instead of getting EPI_ISLs from covspectrum via a query, EPI_ISLs should come from the Usher tree used for designations, as these are the sequences actually designated.

If there wasn't the issue of S:158 being homplasic in XAY, it would be better to start XAY.2.3 with S:158G rather than with the inner ORF1a:A131V, as the spike would be expected to be responsible for any potential growth (dis)advantage.

image

Lineages are on the small end of the spectrum, maybe overly focused on the growth calculation? (Don't really know how it's calculated). E.g. proposed XBF.3.1 seems unremarkable and small.

The new proposals are weird. Often, very small sublineages of growing lineages are proposed. When it would be better to wait how things settle - especially when there are no signs of intrinsic growth advantages due to interesting Spike mutations. Often proposals seem to be .1 of recently designated lineages - when we would usually just wait and see how things turn out with a bit of time.

Designation.csv needs to contain just the strain name, not EPI_ISL and date, e.g. not England/QEUH-3269DF48/2023|OX417919.1|2023-01-09,miscBA.5.2CJ.1.1 but England/QEUH-3269DF48/2023|OX417919.1,XBQ

PR did not remove redesignated strains, previously designated ones need to be removed from their old designation.

Due to the above issues I ended up manually designating those lineages proposed in #193 pulling in strain names from a manually built Usher tree.

Review comments for each lineage can be found here: https://docs.google.com/spreadsheets/d/1vEKbYtX7HDfFtm20t-bGJ6Qog-l_mPVHF4Q3ZIkRNk8/edit?usp=sharing

jmcbroome commented 1 year ago

Thank you again for the detailed feedback!

Responding point by point:

  1. I will talk to Angie about this- I had noticed the odd naming schema, but not prioritized investigating what was going on with them. The simplest solution is just to remove these from the build before inferring new lineages, at the risk of creating new designations that overlap with their former locations- alternatively, perhaps I can implement a blacklist of lineages to ignore.
  2. To clarify, you want the complete set of mutations defining the sublineage all the way back to the root? Right now I'm only reporting mutations separating it from the parent lineage. Making this change is easy, though.
  3. I will have to discuss potentially hosting these files with Angie.
  4. This should be relatively straightforward. There are other alternatives to make it more easily editable as well, of course- using issues, multiple pull requests. It's also relatively simple to just remove the offending lineages from the input branches files and commit the updated files before merging.
  5. As I mentioned in my other response, I hadn't examined the notes closely enough to notice this pattern- only examined them long enough to quickly automate the writing of a note- definitely an oversight.
  6. As I mentioned in my other response, I was using pango_aliasor to handle the aliasing without examining the implementation- I now understand that the aliasor can only compress/decompress a preset selection and cannot infer new third-level compression letters. I can try to implement this programmatically myself, but if I've just misunderstood what pango_aliasor can do, or if you have other code that you use to identify the next-available letter combinations, I would appreciate guidance/materials. Additionally, there are potential problems/conflicts in defining new third-level letter combinations, as there's dependencies between lineages on what gets accepted or not- if we define EG for lineage XBB.1.9.2 when proposing XBB.1.9.2.1, but XBB.1.9.2.1 is rejected, and a different lineage XBB.2.4.3 or whatever does have a sublineage accepted and needs compression, it should be getting the EG- but instead it may have gotten EH, as EG was used with the ultimately rejected lineage. Correctly handling this type of aliasing needs more consideration.
  7. I'm gettin EPI_ISLs from LAPIS because I don't currently have a hosting solution for these files to make them accessible, and they are too numerous to simply print to the PR body even in a dropdown. This is something I can revisit.
  8. I have implemented mutation-level weighting, but not applied it in these particular results, as I wanted to minimize assumptions around the epidemiological impact of different mutations, spike or not.
  9. These lineages are small largely because the recent release seems to have sufficiently covered the active diversity. The growth modeling is only used for sorting, not for lineage creation.
  10. I can increase the minimum time of circulation or the minimum size of a lineage as needed.
  11. This can be parsed out and fixed with relative ease.
  12. I'm not sure I understand this point. The PR is only aware of lineages in the build on the day it was generated- in the case of this one, ones that were incorporated as of the 15th- and therefore probably accepted a few days before that. I do not have any way of telling whether a given lineage has an issue that was opened recently that is associated with it, or was designated between the latest data incorporated into the build and the time of review.
AngieHinrichs commented 1 year ago

When the parent starts with misc this isn't actually usable - one needs to figure out what the real parent actually is. covSpectrum will return no EPI_ISLs due to misc not being a real Pango lineage but a way @AngieHinrichs invented to deal with Usher messiness of certain parts of the tree not following the Pango structure.

Yes, misc is a warning flag that sequences on the branch may be recombinants (or garbage/mixtures/coinfections/contamination) and branches are less likely than usual to imply shared descent. Human analysis and decision-making will definitely be required to determine the real lineage parents of the proposed new lineage and whether it's recombinant or garbage, but that's why these are proposals and not just auto-pushes.

As for how to make it more compatible with CoV-Spectrum: if it's possible to walk back from the misc label to find the nearest ancestor with a real Pango lineage label (e.g. for miscBA.5.2CJ.1 it would be BM.1.1.1), then that label could be passed to CoV-spectrum and it will hopefully be at least close to the intended set.

(For miscDeltaBA1Post11k it would not be the nearest ancestral label miscDeltaBA1Post4k because that's not a real lineage either, it would be BA.1.1... which is not ideal because some recombinants on that big branch are from BA.1, but whatever, we're well past the DeltaBA1 days now. :)

CoV-Spectrum can't always find the labels from the tree anyway because some of them are very new and can't be assigned to sequences yet, but it's easier to guess the parental lineage for a real lineage like "XBB.1.5.7" than a "miscBlahBlah."

AngieHinrichs commented 1 year ago

To be able to work with the proposed lineage, it'd be very helpful to have the set of mutations leading up to the defining nodes (reversions being cancelled out) so one can put them into covSpectrum with [30-of:...] etc

To clarify, you want the complete set of mutations defining the sublineage all the way back to the root? Right now I'm only reporting mutations separating it from the parent lineage. Making this change is easy, though.

Ah, if you include enough ancestral-lineage mutations then it should be possible to omit the lineage from the CoV-Spectrum query, and that would avoid the problem with misc labels and the problem with brand-new labels that are in the tree but not yet assignable to sequences by nextclade. With Omicron lineages, the whole set might be more like 65-of than 30-of which might be a little burdensome for CoV-Spectrum? But going back 25 or 30 mutations (after cancelling out reversions) should be sufficient to get the right set of sequences from the query.

I think the way the distinguishing mutations are reported in the table is just fine; listing all the mutations in the table would take up too much space.

AngieHinrichs commented 1 year ago

For proper analysis, a link to the open tree on Taxonium is not enough, one needs to be able to look at the full Usher tree. What's the best/quickest way to get to that? A few EPI_ISLs would be sufficient to load it with Usher.

My favorite way to look at proposed lineages, since I have access to the full tree, is to get a list of the full |-concatenated names and paste those into taxonium's multi-Names search input. Then in a few seconds I get a cloud of circles and the proposed node is usually obvious from that. I can share the full tree with registered GISAID users but since that's a more manual process it only happens every week or two and I understand why pre-prepared Nextstrain JSON would be preferable.

It'd be best if the Usher tree was already created and could be loaded from a link like (ideally without trash so it doesn't disappear after 1-2 days)

matUtils can also make Nextstrain JSON -- metadata would need to be included in the JSON in order to use the https://nextstrain.org/fetch/... interface. As far as hosting, Ryan Hisner has created a repo just for storing JSONs that he uses in pango-designation proposals, that might be one way. [I know, really I should provide storage in the web interface instead of using our temporary storage area.]

AngieHinrichs commented 1 year ago

Lineages must be correctly aliased -> instead of XBB.1.9.2.1, XBB.1.9.2 needs to be given an alias and then it's EG.1 etc

As I also said in #193, I think this goes against ease of cherry-picking and it would be better to make a separate little script to choose the next available alias and update the files after someone decides to accept the proposed lineage.

AngieHinrichs commented 1 year ago

Designation.csv needs to contain just the strain name, not EPI_ISL and date, e.g. not England/QEUH-3269DF48/2023|OX417919.1|2023-01-09,miscBA.5.2CJ.1.1 but England/QEUH-3269DF48/2023|OX417919.1,XBQ

Still not England/QEUH-3269DF48/2023|OX417919.1 but just England/QEUH-3269DF48/2023.

AngieHinrichs commented 1 year ago

PR did not remove redesignated strains, previously designated ones need to be removed from their old designation.

  1. I'm not sure I understand this point. The PR is only aware of lineages in the build on the day it was generated- in the case of this one, ones that were incorporated as of the 15th- and therefore probably accepted a few days before that. I do not have any way of telling whether a given lineage has an issue that was opened recently that is associated with it, or was designated between the latest data incorporated into the build and the time of review.

If I've matched up Jakob's 12 with the correct comment from Cornelius... I think Cornelius was referring to the sequence names (Nextstrain uses "strain" as the name column label) that are already in lineages.csv for the parental lineage.

For example, if lineages.csv already has

England/ABCD-123456/2023,BA.5.314

and then you propose a BA.5.314.1 that includes England/ABCD-123456/2023, it would be a problem for lineages.csv to have both the old and new assignments:

England/ABCD-123456/2023,BA.5.314
...
England/ABCD-123456/2023,BA.5.314.1

-- so when adding lines to lineages.csv, first you need to remove any lines that would cause duplicates like that.

jmcbroome commented 1 year ago

Oh, I think I see. Thank you for clarifying point 12 Angie! I will add that to my pipeline.

RE: the JSON- yes, I would be using matUtils/BTE to produce those, and would include metadata so the fetch link works. Github is far from a perfect file hosting solution, though, and it seems like the maintainers want JSONs that include GISAID samples- wouldn't those have to be hosted somewhere private where only registered GISAID users could access them, like you do with the GISAID tree?

RE: performing aliasing- yes, thank you for raising and clarifying this issue. I have been working a little with a fork of pango_aliasor to potentially generate these new levels. Your proposed solution of having a sort of after-script is interesting, but it depends on how exactly we end up implementing these changes. Under the pull request model, once the files are updated, it could just do a find and replace on the lineages.csv and notes files, perhaps. I think if we do this I might also reenable the 'auto.' prefixing because it makes it easier/more explicit which lineages I've generated, and then we can "lock them in" by running a cleaning/aliasing script that strips the prefix and adds any needed aliases.

AngieHinrichs commented 1 year ago

JSONs that include GISAID samples- wouldn't those have to be hosted somewhere private where only registered GISAID users could access them, like you do with the GISAID tree?

For pango-designation purposes, people have been posting limited-size subtrees that include EPI_ISL IDs in github issues. (UShER web interface max subtree size is 5000.) When a user registers with GISAID, they agree to not share the sequences with anyone not registered with GISAID. But the tree only contains partial information about the sequence -- due to imputation of ambiguous base calls, ignoring of indels, and the kinds of masking performed in the daily build process, the original FASTA sequences can't be reconstructed from the limited information in the tree. And when there are over 15 million sequences in GISAID, a few thousand is a smaller subset all the time.

I think if we do this I might also reenable the 'auto.' prefixing because it makes it easier/more explicit which lineages I've generated, and then we can "lock them in" by running a cleaning/aliasing script that strips the prefix and adds any needed aliases.

Yes, that will make it very clear whether cleanup/aliasing is required or not.