legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Wm82.gnm1.div.Wen_Tan_2014 lacks README #145

Open sammyjava opened 1 year ago

sammyjava commented 1 year ago

This broke the YAML loader because it lacks a README so I yanked the CHECKSUM file from it, which is why it wound up in datastore-metadata in the first place.

StevenCannon-USDA commented 1 month ago

Summary: I have moved Wm82.gnm1.div.Wen_Tan_2014 to private/Glycine/max/diversity/. It can be added again, but this will be a new curation project, involving contacting the authors and requesting the VCF files (there should be two: one for each population).

Details: v2/Glycine/max/diversity/Wm82.gnm1.div.Wen_Tan_2014/ lacked a README.

In preparing a README under the assumption that the data comes from

   Wen Z, Tan R, Yuan J, Bales C, Du W, Zhang S, Chilvers MI, Schmidt C, Song Q, Cregan PB, Wang D. 
   Genome-wide association mapping of quantitative resistance to sudden death syndrome in soybean. 
   BMC Genomics. 2014 Sep 23;15(1):809. doi: 10.1186/1471-2164-15-809. PMID: 25249039; PMCID: PMC4189206.

I am doubting that the data actually comes from this paper, because the accessions described in Supplementary Table 1 are not present in the VCF. Also, the paper describes two VCFs but only one is in this collection. The number of SNPs looks correct for one of the association studies (52051), but the number of accessions is not, since they should have 392 and 300 accessions, respectively:

zcat glyma.Wm82.gnm1.div.Wen_Tan_2014.SNPs.vcf.gz | grep '#CHROM' | tr '\t' '\n' | awk 'NR>=10' | wc -l
     296

Clues:

  cat MANIFEST.Wm82.gnm1.div.Wen_Tan_2014.correspondence.yml
  ---
  # filename in this repository: previous names
    - glyma.Wm82.gnm1.div.Z5VC.SNPs.vcf.gz
    - Zenetal2018.vcf.gz

... but this is no help, as the file in v1 simply points to the current directory in v2: Wm82.gnm1.div.Z5VC -> ../../v2/Glycine/max/diversity/Wm82.gnm1.div.Z5VC ... and I have no idea what Zenetal2018.vcf.gz is.

Given the timeframe of the files and the existence of the collection in v1, this was probably data that Annie handled in ~2018 or so.

So, I give up. Punting to @maxglycine and @jd-campbell if you would like to take this on as a new curation project.

maxglycine commented 1 month ago

I give up also. how do you get to private/Glycine/max/diversity/ ?? @StevenCannon-USDA @jd-campbell

StevenCannon-USDA commented 1 month ago

@maxglycine /usr/local/www/data/private/Glycine/max/diversity on soybase-stage or peanutbase-stage

maxglycine commented 1 month ago

I have sent a request to Dechun Wang for the two VCF files for group P1 and P2. These are two different panels of strains used in the paper. They may have split the strains up by source. Panel 1 included strains that Dechun had in his program and Panel 2 had a selection of PI's. Panel 1 (P1) was genotyped by the Illumina SoySNP50K and Panel 2 (P2) was genotyped by the Illumina SoySNP6k iSelect BeadChip. @StevenCannon-USDA @jd-campbell

maxglycine commented 1 month ago

@StevenCannon-USDA @jd-campbell Unless we combine the two VCF's, this paper will have two VCF files. Is the YAML set up for that?

StevenCannon-USDA commented 1 month ago

@maxglycine TLDR: I think a single collection is fine in this case. Longer: The diversity collections aren't tightly specified regarding how to handle multiple populations, but there is no reason currently that the two populations couldn't go into a single collection. If they used different coordinate systems (which they don't), then that would be a reason to split them. For an example of another collection with multiple populations, see https://data.legumeinfo.org/Glycine/max/diversity/Wm82.gnm2.div.Wickland_Battu_2017/

maxglycine commented 1 month ago

A potential problem would be if there are SNP positions that are in both the 50K and 6K chips. If we combined them, then the VCF file would have non-unique positions and that might cause it to be and invalid VCF file. If the two chips do not share SNPs, then it would be OK to merge them. We would have to re-order the SNPs though to be in chromosomal order. @StevenCannon-USDA @jd-campbell

StevenCannon-USDA commented 1 month ago

I think those chips DO have common SNPs; but it should be fine (preferable I think) to keep them as separate files.