refgenie / plantref

Refgenieserver content repository for plant genomes server
http://plantref.databio.org
0 stars 1 forks source link

Duplicate Arabidopsis_lyrata #10

Closed nsheff closed 3 years ago

nsheff commented 3 years ago

Related to #8

I think these two are identical sequences, with different wrapping:

Arabidopsis_lyrata_JGI_v1_0-fasta-fasta Arabidopsis_lyrata__JGI_v2_1-fasta-fasta

@ieguinoa how do you want to proceed?

nsheff commented 3 years ago

Confirmed. When I wrap them both at the same column width, they give the same checksum:

head -n 1 z1.fa > z1_wrap.fa; cat z1 | sed 1d | tr -d '\n' | fold -w 50 -s > z1_wrap.fa
head -n 1 z2.fa > z2_wrap.fa; cat z1 | sed 1d | tr -d '\n' | fold -w 50 -s > z2_wrap.fa
md5sum z1_wrap.fa 
6e57c10072f0b6bed3460b17ef2c9b87  z1_wrap.fa
md5sum z2_wrap.fa 
6e57c10072f0b6bed3460b17ef2c9b87  z2_wrap.fa

Which should we remove?

ieguinoa commented 3 years ago

Thanks for spotting this. The v1_0 is the correct version for the genome assembly, the 2_1 is just a reannotation (https://phytozome.jgi.doe.gov/pz/portal.html#!info?alias=Org_Alyrata) So, v1_0 stays. I'm taking notes to make sure I make the correct link when adding the annotations.

nsheff commented 3 years ago

Here are 2 more duplicates:

Capsella_rubella_JGI_annotation_v1_0_on_assembly_v1 Capsella_rubella__JGI_v1_0

Ostreococcus_lucimarinus_JGI_2_0 Ostreococcus_lucimarinus_JGI_v2_0_assembly_and_annotation

Which should I remove?

ieguinoa commented 3 years ago

Please keep Capsella_rubella__JGI_v1_0 and Ostreococcus_lucimarinus_JGI_2_0

thanks

nsheff commented 3 years ago

Also: Musa_acuminata_Banana_Genome_v1_0 == Musa_acuminata_Genescope-Cirad

ieguinoa commented 3 years ago

Please keep: Musa_acuminata_Banana_Genome_v1_0