Closed j23414 closed 2 months ago
Thanks @j23414! How did you populate the metadata? From the FASTA headers? We might need to s/find/replace
some of the fields to conform with what ingest expects them to be called - unless you've already done so manually!
Have you done a test run of ingest to see whether the output looks right? Would be good to do that and link to the results! I'll see whether I can do that now.
This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132 Will have to check once workflow is done.
How did you populate the metadata? From the FASTA headers?
Hi @corneliusroemer! I hacked a fix for the fasta file headers using the following perl script (add_ids.pl):
#! /usr/bin/env perl
use strict;
use warnings;
my @TMPIDS=();
for my $i ("TMP0000" .. "TMP0099") {
push @TMPIDS, $i;
}
my $i=0;
while(<>){
if(/>(.*)/){
my $header=$1;
print ">$TMPIDS[$i++]";
print "|INRB";
print "|Africa";
print "|Democratic Republic of the Congo";
print "|$header\n";
}else{
print;
}
}
Then ran
perl add_ids.pl ingest/submission01_mpox47_2024.fasta > fixedheaders.fasta
./ingest/bin/fasta-to-ndjson \
--fasta fixedheaders.fasta \
--fields genbank_accession authors region country strain host ocountry division collected \
--exclude ocountry \
> ingest/data/inrb.ndjson
And then kept checking nextstrain build ingest
runs, editing the field names as needed to get it to run successfully.
This might work: https://github.com/nextstrain/mpox/actions/runs/8883875132 Will have to check once workflow is done.
Ohh, thanks for submitted the github action check! 🙌 Should be able to grep "TMP" from the final sequences.fasta and metadata.tsv files.
Great, thanks for filling me in on the details! There might be a typo in one of your commands ocountry
rather than country
.
ocountry
Thanks for pointing this out! This was on purpose (-exclude ocountry
) ;) It's so I could create a new country
column and avoid
sed 's/DRC/Democratic Republic of the Congo/g'
While DRC
shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.
Test run seems to have worked!
wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/metadata.tsv.gz
wget data.nextstrain.org/files/workflows/mpox/branch/add-inrb-with-permission/sequences.fasta.xz
I'll merge then as it simplifies including the sequences in our builds. If there are outliers/issues, we can always simply exclude the accessions post-ingest, in the phylogenetic/nextclade workflows.
Nice work @j23414! I hope the old instructions were somewhat helpful? Please feel free to update it with any extra steps you had to take here.
Thanks for pointing this out! This was on purpose (
-exclude ocountry
) ;) It's so I could create a newcountry
column and avoid
sed 's/DRC/Democratic Republic of the Congo/g'
While
DRC
shouldn't be in the nucleotides section of a fasta file, I've seen stranger things.
FYI, since all sources go through the ingest pipeline, you could have added this to the geolocation-rules.tsv as
Africa/DRC/*/* Africa/Democratic Republic of the Congo/*/*
Description of proposed changes
From slack, it was requested that the new INRB data on mpox clade I from https://www.medrxiv.org/content/10.1101/2024.04.12.24305195v2 be added to our Nextstrain analysis. INRB is working to add to NCBI, so this is a temporary solution similar to what has been done previously.
After obtaining permission to do so, this PR temporarily adds the records here to be included in the curated dataset.
TMP0000
toTMP0046
authors
to "INRB"Please feel free to push further commits to this branch or suggest changes.
Related issue(s)
Checklist