nextstrain / fauna

RethinkDB database to support real-time virus analysis
GNU Affero General Public License v3.0
33 stars 13 forks source link

Update and streamline mumps build #78

Closed trvrb closed 6 years ago

trvrb commented 6 years ago

@jameshadfield and @lmoncla ---

This PR does a few different things:

  1. Documents new upload, download, etc... calls in build/MUMPS.md.

  2. Deprecates the use of mumps_header_fix.tsv for Genbank uploads. This always felt completely clunky to me. Lots of redundancy and hard to maintain. Now there are three different simple "fix" files: mumps_strain_name_fix.tsv, mumps_location_fix.tsv, mumps_date_fix.tsv. Additionally, almost all the name fixes you had implemented in mumps_header_fix.tsv (like MuVi/Ontario.CAN/04.10[G] to MuVi/Ontario.CAN/04.10/G) were moved to a simple bit of regex in fix_name in mumps_upload.py. This should be hugely more maintainable. The vast majority of the time, a new virus in Genbank won't need anything to happen with it for it to just work.   I've kept mumps_header_fix.tsv to fix Broad sequence headers for the time being using mumps_preprocess_fasta.py.

  3. Importantly, there were 20(!) full genomes that were added to Genbank in 2017 that were completely missing from ViPR. These include the 11 genomes that were collected in 2017 from Japan and the three James noticed from Arkansas. Apparently, ViPR hasn't updated the mumps section for at least 11 months. I've switched to a directly accession download from Genbank. This was super easy actually.   Also, accession download from Genbank can be APIed, so this could be a starting point for automatic database uploads. I'd highly recommend making sacra work with accession lists.


@lmoncla: A note about strain names. In mumps_header_fix.tsv you had things like:

Unless, I'm confused about the meaning of Jeryl Lynn, these all represent the same single primary isolate from Jeryl Lynn in 1963. These should all be collapsed to a single "virus" in our table. Virus includes the primary case metadata. Different "sequences" can be made from this same primary isolate, each with different passage histories, etc... So I've collapsed all these to Jeryl_Lynn.

Generally, we don't want any lab strains in the live site.


I'm waiting to merge until I can get the BCCDC genomes in and for this I need the FASTAs.

lmoncla commented 6 years ago

@trvrb this looks great. Thank you! You are correct about Jeryl-Lynn. Some of the isolates may have different passage histories and been manufactured in different years in different countries, but they all originated in 1963 from Jeryl Lynn. So collapsing all of them into the same virus with different sequences makes sense to me.

jameshadfield commented 6 years ago

@trvrb is this PR trying to replicate the behavior described here https://github.com/nextstrain/fauna/blob/master/builds/MUMPS.md or here https://github.com/nextstrain/fauna/blob/master/MUMPS.md ?

The latter describes the method which has created the live site.

trvrb commented 6 years ago

@jameshadfield: I (obviously) didn't realize that fauna/MUMPS.md existed (only thought to look at builds/ alongside other viruses). Regardless, I think the current PR is definitely the way to go. Maintaining mumps_header_fix.tsv to pass in with --fasta_header_fix is going to be super annoying.

jameshadfield commented 6 years ago

@trvrb can you remove any of the following files that are no longer needed: mumps_citations.tsv mumps_date_fix.tsv mumps_header_fix.tsv mumps_location_fix.tsv mumps_strain_name_fix.tsv

trvrb commented 6 years ago

Yes. It's just the mumps_citations.tsv that's no longer used.

jameshadfield commented 6 years ago

what about lines 6-110 & 236-238 of mumps_header_fix?

trvrb commented 6 years ago

Merging this now.