sdwfrost / gbmunge

Munge GenBank files into FASTA and tab-separated metadata
MIT License
13 stars 7 forks source link

Errors when simplifying the original country data #2

Closed akograf closed 10 months ago

akograf commented 5 years ago

I love this script, it is very useful! However the piece of code that simplify the original country of sampling has a bug and sometimes change the country. For example, it turns USA into Chad, Venezuela becomes Senegal, South Korea becomes South Africa, etc.

Thanks for your attention, Simon.

sdwfrost commented 5 years ago

Hi @akograf ! Thanks for the issue. Can you share examples that go wrong?

akograf commented 5 years ago

Hi Simon, Sorry for the late answer. Some examples of the error in parsing the country can be found below:

name accession length submission_date host country_original country countrycode MF001519 MF001519 12008 2017-05-09 Homo sapiens USA Chad TCD KX352216 KX352216 148 2017-07-31 Homo sapiens Venezuela Senegal SEN KC810970 KC810970 1317 2013-10-29 Homo sapiens South Korea South Africa ZAF

I've also observed a second issue. Sometimes, the script gets the wrong sampling date. For exemple, the sequence LC259094 was collected in 2016-05-11, but the script parse 2006-03-31. This issue happened with all the sequences below, all from the same batch.

LC259094 LC259094 11825 2017-12-22 Homo sapiens Angola Angola AGO 2006-03-31 LC259093 LC259093 11832 2017-12-22 Homo sapiens Malaysia Malaysia MYS 2006-03-31 LC259092 LC259092 12004 2017-12-22 Homo sapiens Cuba Cuba CUB 2006-03-31 LC259091 LC259091 11980 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259090 LC259090 12004 2017-12-22 Homo sapiens Colombia Colombia COL 2006-03-31 LC259089 LC259089 12004 2017-12-22 Homo sapiens Dominica Dominica DMA 2006-03-31 LC259088 LC259088 12007 2017-12-22 Homo sapiens Tonga Tonga TON 2006-03-31 LC259087 LC259087 11980 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259086 LC259086 12005 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259085 LC259085 11980 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259084 LC259084 11991 2017-12-22 Homo sapiens Philippines Philippines PHL 2006-03-31 LC259083 LC259083 11998 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31

Thanks for your attention,

ghost commented 5 years ago

@akograf @sdwfrost this script is cool!

sdwfrost commented 10 months ago

The country code issue was simply due to 'USA' being missing from the lookup array; similarly, "South Korea" was listed as "Korea (South)". Any further issues can be dealt with by editing countrycodes.h and rebuilding. I could not reproduce your issue with dates - it seemed to work fine for me.

sdwfrost commented 10 months ago

Closed with adbead3