Closed akograf closed 10 months ago
Hi @akograf ! Thanks for the issue. Can you share examples that go wrong?
Hi Simon, Sorry for the late answer. Some examples of the error in parsing the country can be found below:
name accession length submission_date host country_original country countrycode MF001519 MF001519 12008 2017-05-09 Homo sapiens USA Chad TCD KX352216 KX352216 148 2017-07-31 Homo sapiens Venezuela Senegal SEN KC810970 KC810970 1317 2013-10-29 Homo sapiens South Korea South Africa ZAF
I've also observed a second issue. Sometimes, the script gets the wrong sampling date. For exemple, the sequence LC259094 was collected in 2016-05-11, but the script parse 2006-03-31. This issue happened with all the sequences below, all from the same batch.
LC259094 LC259094 11825 2017-12-22 Homo sapiens Angola Angola AGO 2006-03-31 LC259093 LC259093 11832 2017-12-22 Homo sapiens Malaysia Malaysia MYS 2006-03-31 LC259092 LC259092 12004 2017-12-22 Homo sapiens Cuba Cuba CUB 2006-03-31 LC259091 LC259091 11980 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259090 LC259090 12004 2017-12-22 Homo sapiens Colombia Colombia COL 2006-03-31 LC259089 LC259089 12004 2017-12-22 Homo sapiens Dominica Dominica DMA 2006-03-31 LC259088 LC259088 12007 2017-12-22 Homo sapiens Tonga Tonga TON 2006-03-31 LC259087 LC259087 11980 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259086 LC259086 12005 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259085 LC259085 11980 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31 LC259084 LC259084 11991 2017-12-22 Homo sapiens Philippines Philippines PHL 2006-03-31 LC259083 LC259083 11998 2017-12-22 Homo sapiens Indonesia Indonesia IDN 2006-03-31
Thanks for your attention,
@akograf @sdwfrost this script is cool!
The country code issue was simply due to 'USA' being missing from the lookup array; similarly, "South Korea" was listed as "Korea (South)". Any further issues can be dealt with by editing countrycodes.h
and rebuilding. I could not reproduce your issue with dates - it seemed to work fine for me.
I love this script, it is very useful! However the piece of code that simplify the original country of sampling has a bug and sometimes change the country. For example, it turns USA into Chad, Venezuela becomes Senegal, South Korea becomes South Africa, etc.
Thanks for your attention, Simon.