nextstrain / mpox

Nextstrain build for mpox virus
https://nextstrain.org/mpox
MIT License
39 stars 16 forks source link

Ingest INRB data with permission #242

Closed j23414 closed 2 months ago

j23414 commented 2 months ago

Description of proposed changes

From slack, it was requested that the new INRB data on mpox clade I from https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2 be added to our Nextstrain analysis. INRB is working to add to NCBI, so this is a temporary solution similar to what has been done previously.

After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset.

Please feel free to push further commits to this branch or suggest changes.

Related issue(s)

Checklist

corneliusroemer commented 2 months ago

Thanks @j23414! How did you populate the metadata? From the FASTA headers? We might need to s/find/replace some of the fields to conform with what ingest expects them to be called - unless you've already done so manually!

Have you done a test run of ingest to see whether the output looks right? Would be good to do that and link to the results! I'll see whether I can do that now.

This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132 Will have to check once workflow is done.

j23414 commented 2 months ago

How did you populate the metadata? From the FASTA headers?

Hi @corneliusroemer! I hacked a fix for the fasta file headers using the following perl script (add_ids.pl):

#! /usr/bin/env perl

use strict;
use warnings;

my @TMPIDS=();

for my $i ("TMP0000" .. "TMP0099") {
    push @TMPIDS, $i;
}

my $i=0;
while(<>){
  if(/>(.*)/){
    my $header=$1;
    print ">$TMPIDS[$i++]";
    print "|INRB";
    print "|Africa";
    print "|Democratic Republic of the Congo";
    print "|$header\n";
  }else{
    print;
  }
}

Then ran

perl add_ids.pl ingest/submission01_mpox47_2024.fasta > fixedheaders.fasta
./ingest/bin/fasta-to-ndjson \
 --fasta fixedheaders.fasta \
 --fields genbank_accession authors region country strain host ocountry division collected \
 --exclude ocountry \
 > ingest/data/inrb.ndjson

And then kept checking nextstrain build ingest runs, editing the field names as needed to get it to run successfully.

This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132 Will have to check once workflow is done.

Ohh, thanks for submitted the github action check! 🙌 Should be able to grep "TMP" from the final sequences.fasta and metadata.tsv files.

corneliusroemer commented 2 months ago

Great, thanks for filling me in on the details! There might be a typo in one of your commands ocountry rather than country.

j23414 commented 2 months ago

ocountry

Thanks for pointing this out! This was on purpose (-exclude ocountry) ;) It's so I could create a new country column and avoid

sed 's/DRC/Democratic Republic of the Congo/g'

While DRC shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.

corneliusroemer commented 2 months ago

Test run seems to have worked!

wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/metadata.tsv.gz
wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/sequences.fasta.xz      

I'll merge then as it simplifies including the sequences in our builds. If there are outliers/issues, we can always simply exclude the accessions post-ingest, in the phylogenetic/nextclade workflows.

joverlee521 commented 2 months ago

Nice work @j23414! I hope the old instructions were somewhat helpful? Please feel free to update it with any extra steps you had to take here.


Thanks for pointing this out! This was on purpose (-exclude ocountry) ;) It's so I could create a new country column and avoid

sed 's/DRC/Democratic Republic of the Congo/g'

While DRC shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.

FYI, since all sources go through the ingest pipeline, you could have added this to the geolocation-rules.tsv as

Africa/DRC/*/*    Africa/Democratic Republic of the Congo/*/*