roblanf / sarscov2phylo

Global phylogenies of SARS-CoV-2 sequences
GNU General Public License v3.0
86 stars 22 forks source link

masked alignment different from global tree in GISAID #21

Open lpipes opened 2 years ago

lpipes commented 2 years ago

Hello, I tried to download the masked alignment from GISAID but it contains >3 million sequences while the global tree they uploaded is only for ~600K sequences. Do you know where I can download the MSA file for the most recent global tree? Thanks.

roblanf commented 2 years ago

Hi @lpipes, there are two parts to this answer. First, the most recent global tree contains almost 3M sequences, although that's still fewer than in the alignment. The reason for the discrepancy is that the alignment contains all sequences, but the tree is built only with those that are good enough to build a tree from.

The older trees only had 600K sequences, because that's all fasttree could handle. These were subsampled to include all of the most recent sequences, and something like 100K other sequences for context.

In both cases, the way to get an alignment that has only the sequences contained in the tree is to pull out of the alignment just the sequences you want. To do that, I'd:

  1. Make a text file of all the sequence names in the tree (one per line)
  2. Extract the corresponding sequences from the alignment using faSomeRecords

Hope that helps!

Rob

lpipes commented 2 years ago

Hi Rob,

Thanks for your explanation. The tree I recently downloaded (dated 2021-09-26) only had ~600K sequences in it. But I just downloaded the most recent tree (dated 2021-10-05) which had ~3million. Using faSomeRecords makes sense but I am actually having a lot of trouble extracting the MSA from the tar file.

tar xf mmsa_2021-10-06.tar.xz xz: (stdin): Unexpected end of input tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now

I also encountered this error with the previous *tar.xz files that were posted. Any idea on what could be the problem?

-Lenore

roblanf commented 2 years ago

Huh, that's odd. I would have done the same, like:

tar -xf alignment.tar.xz

I'll have to take a look next week. But you could try doing the xz first, like:

xz -d alignment.tar.xz

then un-tarring it after that.

On Fri, 8 Oct 2021 at 17:09, Lenore Pipes @.***> wrote:

Hi Rob,

Thanks for your explanation. The tree I recently downloaded (dated 2021-09-26) only had ~600K sequences in it. But I just downloaded the most recent tree (dated 2021-10-05) which had ~3million. Using faSomeRecords makes sense but I am actually having a lot of trouble extracting the MSA from the tar file.

tar xf mmsa_2021-10-06.tar.xz xz: (stdin): Unexpected end of input tar: Unexpected EOF in archive tar: Unexpected EOF in archive tar: Error is not recoverable: exiting now

I also encountered this error with the previous *tar.xz files that were posted. Any idea on what could be the problem?

-Lenore

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/roblanf/sarscov2phylo/issues/21#issuecomment-938371754, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG2SE7RWG3TPUEGFZNW2WDUF2DJ5ANCNFSM5FSKE2CQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Rob Lanfear Division of Ecology and Evolution, Research School of Biology, The Australian National University, Canberra

www.robertlanfear.com

lpipes commented 2 years ago

Hmm seems like that doesn't work either ugh... xz -d mmsa_2021-10-06.tar.xz xz: mmsa_2021-10-06.tar.xz: Unexpected end of input

lpipes commented 2 years ago

In fact, I've tried to extract every single MSA file that they have posted and all of them have an Unexpected EOF in archive. I sent them a message though.