outbreak-info / outbreak.info

During outbreaks of emerging diseases such as COVID-19, efficiently collecting, sharing, and integrating data is critical to scientific research. outbreak.info is a resource to aggregate all this information into a single location.
https://outbreak.info/
GNU General Public License v3.0
33 stars 13 forks source link

Same variants reported without aliases and with aliases #618

Closed 3dgiordano closed 1 year ago

3dgiordano commented 1 year ago

Hello everyone.

I've been using outbreak.info for a while, and I noticed that there are listings and reports of variants without their Pango alias that do not count or act on the statistics or counters of the correct variant (the alias).

I understand that the way in which a variant must always be designated by means of its alias, and that when there is a reference to both the alias and its nomenclature without aliases, we are talking about the same variant, so the terms should not be different. statistics.

I'm going to give you an example. B.1.1.529.5.2.1 and BA.5.2.1 (alias of B.1.1.529.5.2.1)

However, if I query for both statistics and sequence comparison or countries, most of the data is in B.1.1.529.5.2.1 and not in its alias BA.5.2.1 However, I can't tell whether or not the data in BA.5.2.1 is the same sequences, it seems to be different, and what was missing is some process that converts the non-aliased version to the aliased one.

It is very easy to see in the variant comparison that the same variant is being discussed but different sequence totals are shown. If I go to see which countries have the sequences B.1.1.529.5.2.1, I notice that they are not the same ones that have BA.5.2.1, when talking about it. https://outbreak.info/compare-lineages?pango=B.1.1.529.5.2.1&pango=BA.5.2.1&gene=ORF1a&gene=ORF1b&gene=S&gene=ORF3a&gene=E&gene=M&gene=ORF6&gene=ORF7a&gene=ORF7b&gene=ORF8&gene=N&threshold=75&nthresh=1&sub=false&dark=false

Maybe there is something going on in your process of migrating variant ids to aliases and correcting the data, summarizing it or something similar. Since this may have a potential stats issue, I didn't want to stop reporting it.

Regards

3dgiordano commented 1 year ago

I did an analysis of your data and the list of suspects that should have in the database the data in alias format are

'B.1.1.529.1.1.1' 'B.1.1.529.1.17.2' 'B.1.1.529.2.10.1' 'B.1.1.529.2.12.1' 'B.1.1.529.2.3.16' 'B.1.1.529.2.3.2' 'B.1.1.529.2.3.20' 'B.1.1.529.2.3.21' 'B.1.1.529.2.38.3' 'B.1.1.529.2.75.1' 'B.1.1.529.2.75.2' 'B.1.1.529.2.75.3' 'B.1.1.529.2.75.3.1.1.1' 'B.1.1.529.2.75.3.1.1.3' 'B.1.1.529.2.75.3.4.1.1' 'B.1.1.529.2.75.4' 'B.1.1.529.2.75.5' 'B.1.1.529.2.75.6' 'B.1.1.529.2.75.9' 'B.1.1.529.4.1.10' 'B.1.1.529.4.6.5' 'B.1.1.529.5.1.10' 'B.1.1.529.5.1.15' 'B.1.1.529.5.1.21' 'B.1.1.529.5.1.22' 'B.1.1.529.5.1.23' 'B.1.1.529.5.1.25' 'B.1.1.529.5.1.26' 'B.1.1.529.5.1.29' 'B.1.1.529.5.10.1' 'B.1.1.529.5.2.1' 'B.1.1.529.5.2.16' 'B.1.1.529.5.2.18' 'B.1.1.529.5.2.20' 'B.1.1.529.5.2.21' 'B.1.1.529.5.2.24' 'B.1.1.529.5.2.24.2.1.1' 'B.1.1.529.5.2.25' 'B.1.1.529.5.2.26' 'B.1.1.529.5.2.27' 'B.1.1.529.5.2.3' 'B.1.1.529.5.2.31' 'B.1.1.529.5.2.33' 'B.1.1.529.5.2.36' 'B.1.1.529.5.2.38' 'B.1.1.529.5.2.6' 'B.1.1.529.5.2.7' 'B.1.1.529.5.3.1' 'B.1.1.529.5.3.1.1.1.1' 'B.1.1.529.5.3.1.1.1.1.1.1.1' 'B.1.1.529.5.3.1.1.1.1.1.1.14' 'B.1.1.529.5.3.1.1.1.1.1.1.15' 'B.1.1.529.5.3.1.1.1.1.1.1.3' 'B.1.1.529.5.3.1.1.1.1.1.1.5' 'B.1.1.529.5.3.1.1.1.1.1.1.7' 'B.1.1.529.5.3.1.1.1.1.1.1.8' 'B.1.1.529.5.3.1.1.1.2' 'B.1.1.529.5.3.1.4.1.1' 'B.1.1.529.5.6.2'

I hope I can help

For the analysis I used a table that I maintain and use where I have the alias of each pango id. https://github.com/3dgiordano/SARS-CoV-2-Variants/blob/main/data/pango.csv The table is generated from the analysis of https://github.com/cov-lineages/pango-designation/blob/master/lineage_notes.txt

Regards

3dgiordano commented 1 year ago

I just noticed that the problem in the data was fixed in production. I just wanted to warn you, because if you carry out the analysis of the case that I posted, you will not be able to see the point that I wanted to show. Now the statistics were joined and in the example B.1.1.529.5.2.1 and BA.5.2.1, they now have the same statistical information and in their comparison they are displayed as the same data.

I leave the case still open so that it can continue to be analyzed.

flaneuse commented 1 year ago

Thanks @3dgiordano for raising this issue. As you noted, our latest data release fixed this issue. We're still investigating what the cause was for the aliasing issue you noticed. Let us know if you run into other issues. Thanks!

3dgiordano commented 1 year ago

Hi @flaneuse I have been analyzing the last two weeks and it has not happened again. I do not know if you made any identification or not, for me, I do not see a problem in closing this issue report if they are not following it up. In case I find any inconsistencies again, I would report them again. I close the issue, if you see that it is necessary to keep it open, reopen it.