GISAID all-sequences fasta should be directly usable by nextstrain/ncov

brianpardy commented 4 years ago

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:

There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID
There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace
The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv
The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv

I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

emmahodcroft commented 4 years ago

Raised as an idea after issue #52 and the thought that more such error issues may come our way from those using the file download.

emmahodcroft commented 4 years ago

Thanks @brianpardy , I've linked to this on our internal Nextstrain convos so we can consider :)

brianpardy commented 4 years ago

Hi @emmahodcroft, I did go ahead and create a simple script that works on my local install using the current gisaid_cov2020_sequences.fasta file. I committed it to my fork with https://github.com/brianpardy/ncov/commit/b4010511c6eb3919ff917a7fd93b5127d248c412

It uses only cat, sed, awk, and grep. Call as: scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

brianpardy commented 4 years ago

It is certainly the wrong way to do it but I also added a Snakefile rule called 'gisaid' that will run this script to create sequences.fasta from gisaid_cov2020_sequences.fasta. I don't know enough to change the Snakefile to replace the download rule with the gisaid rule, but calling "snakemake gisaid" on my copy will generate sequences.fasta, and "snakemake -f gisaid" will regenerate it when a new download from gisaid is placed in data/.

emmahodcroft commented 4 years ago

Thanks @brianpardy ! We are looking into some possible solutions here. We are going to try and make this work better, but we'll need to iron out some details on how to organise that :)

brianpardy commented 4 years ago

That sounds good to me, @emmahodcroft, if I helped spur some thought I'm happy. I did make one more change to my script on the embedded spaces item, I noticed the "Hong Kong" sequences were in metadata.tsv with the space removed, not converted to underscore, so they could not be matched, I fixed that. I'll also add some error checking for calls without naming the files on the commandline. If the team elects to use this great, if not I still appreciate having the issue considered.

brianpardy commented 4 years ago

As @jameshadfield mentioned this issue in https://github.com/nextstrain/ncov/pull/57 I should add that as of right now the script I offered does not work perfectly with the current all sequences file from GISAID: the three new Hong Kong sequences EPI_ISL_412028, EPI_ISL_412029, and EPI_ISL_412030 have duplicate strain names to earlier submissions EPI_ISL_408975, EPI_ISL_409020, and EPI_ISL_409024. The awk statement in my normalize_gisaid_fasta script keeps only the first instance of a duplicate strain name and discards all additional instances. When run, my script will currently only keep the earlier, partial Spike glycoprotein sequences and will discard the newer, complete genomes. For the moment I am manually removing those three partial sequences from the GISAID download before running my script.

I wanted to keep the script simple and obvious but it could probably be extended to keep the longest sequence found instead of the first, at the expense of readability and complexity.

brianpardy commented 4 years ago

Sorry about those extra commits showing up on the issue log, I'm learning how to deal with branching properly so I can submit a pull request and I was not expecting that quite yet. Please ignore the first one.

I updated my script to resolve this issue. I added a 3rd commandline parameter for minimum length that defaults to 15000. I am calling it from my Snakefile using params.min_length and it is working fine. This resolves, for now, the problem of normalize_gisaid_fasta.sh keeping the first appearing, shorter sequence, instead of the later appearing, complete sequence, when sequence names collide.

I set this up on a clean branch on my fork that should merge cleanly if the team accepts the pull request I am about to submit for commit https://github.com/brianpardy/ncov/commit/d3c90c751696a089e85538825639bc2be131b4b4

No offense taken if unwanted.

emmahodcroft commented 4 years ago

Hi @brianpardy , thank you for the work! Yes, these are the same issues we are running into on our end. We're still trying to figure out the best way to deal with this both for public users and for our own internal builds (which need to be aligned between all of us who update Nextstrain, etc, so are a bit more complicated). We're all a bit short on time at the moment unfortunately, so progress is slow - sorry!

xzhuo commented 4 years ago

It may sound silly but I have to ask: where is the "all-sequences download button for SARS-CoV-2 sequences" in GISAID? I could not find it...

emmahodcroft commented 4 years ago

On GISAID, in the EpiCoV tab - bottom right 'Download' button, under the table.

emmahodcroft commented 4 years ago

You will need a GISAID account to do this.

xzhuo commented 4 years ago

I registered. I can see each entry with a "download metadata" and a "download fasta" button. But I could not find a button to download all of them.

brianpardy commented 4 years ago

You need to be on the main 'browse' screen that lists all of the deposited sequences, not the individual-sample screen that contains the 'download metadata' and 'download fasta' buttons. The button is just labeled "Download" with an icon on it, to the right of the screen paging tools.

xzhuo commented 4 years ago

Thank you very much! do you mean the excel table? I can download an excel table with all the entries by clicking "Download Acknowledgement Table for all submissions here". But I still cannot get a fasta file...

brianpardy commented 4 years ago

It looks like the download button is not appearing for you at all. Are you able to scroll your screen to the right? The page I see includes a download button as shown below.

xzhuo commented 4 years ago

No, I don't have that button. Thank you both very much for replying! Now I have to try something else.

melkebir commented 4 years ago

@xzhuo : Same issue for me, gisaid removed the download button. Did you figure out an alternative solution?

xzhuo commented 4 years ago

Not yet. A crawler?

pedroelbanquero commented 4 years ago

why no add the fasta of sars, mers and the others of the family ?, when is a new virus ?

wwydmanski commented 4 years ago

@pedroelbanquero sure, it can be done at one point. For now I've added parsing metadata of the samples to the scrapper, it should yield some interesting information

rvosa commented 4 years ago

I've been having trouble getting @wwydmanski's scraper to work (it errors at the end of the first page because the DOM seems to have changed). @melkebir's scraper does work. This is both on a macbook 10.14.6.

wwydmanski commented 4 years ago

@rvosa maybe it's OS dependent? I've tested it only on windows 10

trvrb commented 4 years ago

I'll leave this issue open for discussion. If you successfully download gisaid_cov2020_sequences.fasta from GISAID then the merged #59 should make preparation of sequences.fasta straight forward. You can run

./scripts/normalize_gisaid_fasta.sh data/gisaid_cov2020_sequences.fasta data/sequences.fasta

and then just proceed with snakemake -p or nextstrain build. We've done additional curation on top of GISAID's but this is all visible in the metadata.tsv file.

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

brianpardy commented 4 years ago

Thank you for the merge, @trvrb!

As another followup to this, users running local nextstrain/ncov instances based on the normalized GISAID fasta download may notice inconsistencies in their local results vs those on the nextstrain.org site. Occasionally sequences released on GISAID are later withdrawn or set as non-public, at which point they no longer appear in the gisaid_cov2020_sequences.fasta file provided by GISAID. Nextstrain itself appears to be using an independent archive that does not always immediately reflect the removal of sequences from GISAID (though it has in the past).

For example, the current GISAID download lacks many of the Guangdong sequences from March 9th currently visible on nextstrain.org/ncov.

tolot27 commented 4 years ago

I don't understand what's going on with some people being able to see a "Download" button and others not able to. I'd suggest to continue to contact GISAID support about this.

It looks like GISAID made some correction after contacting them via E-Mail. Now I see the Download button again. :smiley:

abitrolly commented 4 years ago

Is it possible to submit this validator script to GISAID to improve the data quality on their side?

pedroelbanquero commented 4 years ago

@pedroelbanquero sure, it can be done at one point. For now I've added parsing metadata of the samples to the scrapper, it should yield some interesting information

looking proteins of the first samples of sars in 2004, seems more similar to this samples than other of 2018 pr 2019, how is possible with 17 years of evolution ?, and with this really mutable virus, who change a lot in 3 months ?

i was using exonerate to compare the code alignment.

./exonerate orf1ab2004.fasta covid

C4 Alignment:

     Query: AAP49011.4 orf1ab polyprotein [SARS coronavirus ZJ01]
    Target: NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
     Model: ungapped:protein2dna
 Raw score: 217

Query range: 925 -> 989 Target range: 3034 -> 3226

926 : TyrProProAspGluGluGluGluAspAspAlaGluCysGluGluGluGluIleAspGluTh : 946 !:! !||| !!:|||!!:|||!!:!!:!.!!!:|||||||||||||||.!!!!: !:! PheTyrProProAspGluAspGluGluGluGlyAspCysGluGluGluGluPheGluProSe 3035 : TTCTACCCTCCAGATGAGGATGAAGAAGAAGGTGATTGTGAAGAAGAAGAGTTTGAGCCATC : 3095

947 : rCysGluHisGluTyrGlyThrGluAspAspTyrGlnGlyLeuProLeuGluPheGlyAlaS : 967 ! !:!!:!!|||||||||||||||||||||||||||||| !||||||||||||||||||! rThrGlnTyrGluTyrGlyThrGluAspAspTyrGlnGlyLysProLeuGluPheGlyAlaT 3096 : AACTCAATATGAGTATGGTACTGAAGATGATTACCAAGGTAAACCTTTGGAATTTGGTGCCA : 3158

968 : erAlaGluThrValArgValGluGluGluGluGluGluAspTrpLeuAspAspThrThrGlu : 987 !!:!!! .!!:!!!:! !|||||||||:!!||||||||||||||||||||| !!:!:!! hrSerAlaAlaLeuGlnProGluGluGluGlnGluGluAspTrpLeuAspAspAspSerGln 3159 : CTTCTGCTGCTCTTCAACCTGAAGAAGAGCAAGAAGAAGATTGGTTAGATGATGATAGTCAA : 3218

988 : GlnSer : 989 |||!!! GlnThr 3219 : CAAACT : 3226

...

3896 : LeuSerMetGlnGlyAlaValAspIleAsnArgLeuCysGluGluMetLeuAspAsnArg : 3915 ||||||||||||||||||||||||||||||!:!||||||||||||||||||||||||||| LeuSerMetGlnGlyAlaValAspIleAsnLysLeuCysGluGluMetLeuAspAsnArg 12018 : CTTTCCATGCAGGGTGCTGTAGACATAAACAAGCTTTGTGAAGAAATGCTGGACAACAGG : 12077

3916 : AlaThrLeuGlnAlaIleAlaSerGluPheSerSerLeuProSerTyrAlaAlaTyrAla : 3935 ||||||||||||||||||||||||||||||||||||||||||||||||||||||!:!||| AlaThrLeuGlnAlaIleAlaSerGluPheSerSerLeuProSerTyrAlaAlaPheAla 12078 : GCAACCTTACAAGCTATAGCCTCAGAGTTTAGTTCCCTTCCATCATATGCAGCTTTTGCT : 12137

3936 : ThrAlaGlnGluAlaTyrGluGlnAlaValAlaAsnGlyAspSerGluValValLeuLys : 3955 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ThrAlaGlnGluAlaTyrGluGlnAlaValAlaAsnGlyAspSerGluValValLeuLys 12138 : ACTGCTCAAGAAGCTTATGAGCAGGCTGTTGCTAATGGTGATTCTGAAGTTGTTCTTAAA : 12197

3956 : LysLeuLysLysSerLeuAsnValAlaLysSerGluPheAspArgAspAlaAlaMetGln : 3975 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| LysLeuLysLysSerLeuAsnValAlaLysSerGluPheAspArgAspAlaAlaMetGln 12198 : AAGTTGAAGAAGTCTTTGAATGTGGCTAAATCTGAATTTGACCGTGATGCAGCCATGCAA : 12257

3976 : ArgLysLeuGluLysMetAlaAspGlnAlaMetThrGlnMetTyrLysGlnAlaArgSer : 3995 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ArgLysLeuGluLysMetAlaAspGlnAlaMetThrGlnMetTyrLysGlnAlaArgSer 12258 : CGTAAGTTGGAAAAGATGGCTGATCAAGCTATGACCCAAATGTATAAACAGGCTAGATCT : 12317

3996 : GluAspLysArgAlaLysValThrSerAlaMetGlnThrMetLeuPheThrMetLeuArg : 4015 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GluAspLysArgAlaLysValThrSerAlaMetGlnThrMetLeuPheThrMetLeuArg 12318 : GAGGACAAGAGGGCAAAAGTTACTAGTGCTATGCAGACAATGCTTTTCACTATGCTTAGA : 12377

2004 sars protein is more similar to covid 19 , than 2018 sars, i think is interesting add sars data and mers, all from the family, i was using data from https://www.ncbi.nlm.nih.gov/, they have a lot of fasta files

xzhuo commented 4 years ago

If you are interested, please check out https://nextstrain.org/groups/blab/beta-cov and https://nextstrain.org/groups/blab/sars-like-cov. I don't know which 2018 SARS are you talking about, but you are more related to your aunt/uncle than your cousin even though you and your cousin are about the same age. Does it make sense? Besides, if you want to discuss evolutionary analysis about SARS-CoV-2 http://virological.org/ is probably a better platform...

pedroelbanquero commented 4 years ago

If you are interested, please check out https://nextstrain.org/groups/blab/beta-cov and https://nextstrain.org/groups/blab/sars-like-cov. I don't know which 2018 SARS are you talking about, but you are more related to your aunt/uncle than your cousin even though you and your cousin are about the same age. Does it make sense? Besides, if you want to discuss evolutionary analysis about SARS-CoV-2 http://virological.org/ is probably a better platform...

i was looking this examples but seems just 52 genome samples, the protein im talking more similar than others is AAP49011.4 from 2004 https://www.ncbi.nlm.nih.gov/protein/AAP49011.4, i think is needed more data to the samples,

2018 sample is

./exonerate betacorona2018.fasta orf1ab2004.fasta Command line: [./exonerate betacorona2018.fasta orf1ab2004.fasta] Hostname: [syudaiwqmdx.znmoczi]

C4 Alignment:

     Query: NC_019843.3 Middle East respiratory syndrome coronavirus, complete genome
    Target: AAP49011.4 orf1ab polyprotein [SARS coronavirus ZJ01]
     Model: ungapped:dna2protein
 Raw score: 5554

Query range: 13816 -> 17983 Target range: 4504 -> 5893

13817 : GATCAAAATAGCGAAGTGCTTAAGGCTATCTTAGTGAAGTATGGTTGCTGTGATGTTACC : 13874 AspGlnAsnSerGluValLeuLysAlaIleLeuValLysTyrGlyCysCysAspValThr !!: !||| !!!!:..!||||||! |||||||||! !|||..!|||||||||! ! ! 4505 : GluGlyAsnCysAspThrLeuLysGluIleLeuValThrTyrAsnCysCysAspAspAsp : 4524

13875 : TACTTTGAAAATAAACTCTGGTTTGATTTTGTTGAAAATCCCAGTGTTATTGGTGTTTAT : 13934 TyrPheGluAsnLysLeuTrpPheAspPheValGluAsnProSerValIleGlyValTyr ||||||.!.!!.||| !|||!:!||||||||||||||||||..!:!!:!! !!|||||| 4525 : TyrPheAsnLysLysAspTrpTyrAspPheValGluAsnProAspIleLeuArgValTyr : 4544

13935 : CATAAACTTGGAGAACGTGTACGCCAAGCTATCTTAAACACTGTTAAATTTTGTGACCAC : 13994 HisLysLeuGlyGluArgValArgGlnAlaIleLeuAsnThrValLysPheCysAspHis !!!.|||||||||||||||||||||:!!:!!|||!!.||||||:!!||||||||| ! 4545 : AlaAsnLeuGlyGluArgValArgGlnSerLeuLeuLysThrValGlnPheCysAspAla : 4564

13995 : ATGGTCAAGGCTGGTTTAGTCGGTGTGCTCACACTAGACAACCAGGACCTTAATGGCAAG : 14054

looks more different than new covid 19, covid is more similar to the protein of 2004 than a evolution sars version, new covid have 100 % coincidences with 2004 protein, and evolution versions no.

why ?

thanks for the link but i can't see where i can register

larsonreever commented 4 years ago

this will help us a long way as we are coming up with a community based solution to track coronavirus near you - initiative named Corona Warriors, it will be an innovative step ahead to help the spread. Github has lots of good resources which we can levarage. Contributions & partnerships are welcome.

pedroelbanquero commented 4 years ago

gisaid.org seems not provide access to normal public, anywhere the data ?

babarlelephant commented 4 years ago

@pedroelbanquero At the bottom of https://github.com/nextstrain/ncov there is a link to json file containing all the mutations. I made a script to transform the tree of mutations into a list (the problem is that the tree represents the "time" one on nextstrain, not the "divergence" one)

@brianpardy Would you mind checking on Gisaid (I can't) if they updated this sequence EPI_ISL_414628 at most 2 days ago. There was a 12 nucleotides contiguous mutation in the 3-UTR and I would like to know if I should expect the sequences to be corrected every 3 days (if so maybe add a warning on nextstrain or twitter?). On 19 March it was as follow: "nuc": ["A24956G", "G29701A", "G29702A", "G29703A", "G29705A", "G29706A", "T29709A", "T29710C", "G29711A", "G29715A", "G29717A", "C29718A", "C29719A"] now it is same as original AGGGAGGACTTGAAAGAGCCA

brianpardy commented 4 years ago

Hi @acx01b, unfortunately I don't have an exhaustive list of which sequences have been revised when, but there have been multiple cases in the past where samples with apparent sequencing errors have been revised at various times since they were originally deposited, not specifically 3 days later but just as needed, it appears.

If you run the scripts/get-data.sh script contained in the nextstrain/auspice distribution, you can retrieve the ncov.json files as used on nextstrain.org over previous days and you may be able to find what you are looking for in the data generated there.

TrentBrick commented 4 years ago

After not being able to download GISAID data and trying both of the web scrapers linked to on this threat, I emailed GISAID about the problem and the download button has now appeared for me. If the button doesn't appear for you (bottom right, you can't miss it) then just email them and say why you need to be able to download all of the data.

trvrb commented 4 years ago

Hi everyone,

Could you please refrain from posting links to web scrapers of GISAID? These scrapers are harmful to the functioning of GISAID. I'm going to specifically delete comments that include these scrapers.

Thank you all.

abitrolly commented 4 years ago

@trvrb scrapers still require valid GISAID account. Could you please explain why they are harmful? If GISAID needs a specialist to help them with managing server load, I can help.

ZeweiSong commented 4 years ago

Filing this as an issue as suggested by @emmahodcroft:

GISAID provides an all-sequences download button for SARS-CoV-2 sequences. The provided file is not directly usable as a sequences.fasta file in nextstrain/ncov because of several issues in the GISAID file:
1. There is at least one duplicate sequence name (Italy/INMI1/2020) that causes errors in augur filter, and often other duplicates exist before they are renamed on GISAID

2. There are several sequences with "Hong Kong" in their names that cause errors in augur filter due to the sequence name being truncated at whitespace

3. The sequence names are appended with the EPI_ISL identifier and a datestamp, which are not stripped when loaded and cause mismatches with sequence names in metadata.tsv

4. The sequence names are prepended with 'BetaCoV' or 'BetaCov' which is not stripped when loaded and causes mismatches with metadata.tsv
I suggest a new bash script in ncov/scripts/ that would optionally normalize the GISAID all-sequences download file so that users can use it directly without a need to manually remove duplicates or edit sequence names or maintain their own automated pipeline to generate data/sequences.fasta.

I also suggest automating this in the Snakefile if possible, but I'm not sure how. If no data/sequences.fasta file exists, but data/gisaid_cov2020_sequences.fasta exists, run scripts/normalize-gisaid-fasta.sh before the rest of the pipeline.

I can likely have a base script written to do the normalization by this evening if there is interest.

Thank you for considering this idea.

I cannot find the download all button on 2020/3/25, maybe they just removed it? The only way I can check the sequence is by Browsering, but that means to download one record at a time.

Anyone else has the same problem?

rvosa commented 4 years ago

The sequence availability issue is something that is problematic beyond nextstrain per se. Perhaps it makes sense if someone from the nextstrain core kept an eye on the activities towards data sharing that are being developed by the participants of the covid-19 biohackathon.

TrentBrick commented 4 years ago

ZeweiSong, you need to email them. They should enable it for you then. Not at all clear why this is the case -- super frustrating in fact. But this is what happened for me. (also don't expect them to email you back, just check again 24 hours later and see if the button appears).

melkebir commented 4 years ago

Seconding @TrentBrick and @trvrb messages -- best way forward is to contact GISAID and request access. Please do not use scrapers -- with increasing number of sequences and number of interested users this would essentially amount to a denial of service.

victorlin commented 4 years ago

Not sure why my comment was removed - calling the Javascript function that triggers the download should be just as costly as using the download button itself. You would still need access to the page in the first place.

rvosa commented 4 years ago

I'm gearing up to formulate a request for data access and sharing on behalf of the biohackathon (there's a special covid-19 edition starting soon). I asked GISAID on twitter but I don't think they're very active there. I've had some interaction via their issue tracker so I'll next try in that way.

Would it make sense to ask on behalf of (or with reference to) the nextstrain user community at the same time? Please let me know if I should do that.

The general idea is not to nag or complain. I'm sure they're very busy right now. Also, I imagine they are simply under existing agreements with data submitters that they have to comply with. However, maybe there are other ways in which they can meet their obligations and still accomplish data access with less friction. That will probably involve both technical implementation and social busywork. It seems to me that there are many people willing and able to help with both of these right now.

Something structural needs to improve that we mustn't try to address with screen scrapers and javascript backdoors. More and more researchers want to do good work with these data. It is part of GISAID's stated mission to enable that. We ought to work together to make that possible in an open and collaborative way.

abitrolly commented 4 years ago

@rvosa maybe they (GISAID) think that the data will be used in a malicious way? Because if not, then maybe there is insufficient funding and poor technical excellence to avoid DoS. E.g. setup memcache.

palatos commented 4 years ago

Is anyone else having trouble accessing GISAID right now? It was hard for me to create an account, but now that I have one the ncov tab just doesn't load. I'm not sure why it's so hard to obtain the sequences. Makes analyzing the data so much harder compared to the ones deposited in genbank.

vscooper commented 4 years ago

No trouble creating an account, but download requests keep throwing errors.

oneillkza commented 4 years ago

@palatos it's been up and down for me. Just keep re-trying. Fortunately the actual fasta download is pretty small and quick once you get in.

woson2020 commented 4 years ago

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

woson2020 commented 4 years ago

@xzhuo Are you able to download all genome sequence of ncov?

canholyavkin commented 4 years ago

@brianpardy I can't see the button for all genome sequence download,how can I figure it out?

@woson2020, as suggested above you'll need to request download access from GISAID. After you login, you can send a message through Contact page. They generally gave the access in a 1-2 days.

nextstrain / ncov

GISAID all-sequences fasta should be directly usable by nextstrain/ncov #53

C4 Alignment:

C4 Alignment: