Escobar_Oladzad_2022 GFF is subset of Andean_MidAmer_GBS GFF

sammyjava commented 2 years ago

This may not be an issue, but when working with bean GWAS I hit a situation where I've got two markers per result, because I've got two marker sets with the same markers:

$ zgrep phavu.G19833.gnm2.S06_5422068 */*.gff3.gz
G19833.gnm2.mrk.Andean_MidAmer_GBS/phavu.G19833.gnm2.mrk.Andean_MidAmer_GBS.gff3.gz:phavu.G19833.gnm2.Chr06 Andean_MidAmer_GBS  genetic_marker  5422068 5422068 .   .   .   ID=phavu.G19833.gnm2.S06_5422068;Name=S06_5422068
G19833.gnm2.mrk.Escobar_Oladzad_2022/phavu.G19833.gnm2.mrk.Escobar_Oladzad_2022.gff3.gz:phavu.G19833.gnm2.Chr06 PMID_35106945   genetic_marker  5422068 5422068 .   .   .   ID=phavu.G19833.gnm2.S06_5422068;Name=S06_5422068

Clearly Escobar, Oladzad, et al. used the same markers as Oladzad, et al. to analyze their MAGIC population. We've got 52,206 markers in their marker set while there are 355,356 in the Andean_MidAmer_GBS. It's the same markers.

Do we want to have multiple marker sets with exactly the same markers mapped to the same genomes? This causes the above issue for me, because I simply load the marker name from the GWAS or QTL study (S06_5422068 in this example) and then match up those names with those in the GFF.

I can, and probably should, restrict to markers from the same genotyping_platfform, but I just thought I'd throw this up for thought about marker mapping redundancy.

My preference would be to update the mixed.gwas.Escobar_Oladzad_2022 README with genotyping_platform: Andean_MidAmer_GBS and delete the G19833.gnm2.mrk.Escobar_Oladzad_2022 genomic marker collection because it adds nothing.

But this seems to be a slightly higher level question so I'm posing it. @cann0010

sammyjava commented 2 years ago

What say you, @cann0010 ?

StevenCannon-USDA commented 2 years ago

It looks like Escobar_Oladzad has 7461 markers that aren't in Andean_MidAmer_GBS, and Andean_MidAmer_GBS has 310611 that aren't in Escobar_Oladzad:

# Only in Escobar_Oladzad_2022:
  comm -23 lis.G19833.gnm2.mrk.Escobar_Oladzad_2022 lis.G19833.gnm2.mrk.Andean_MidAmer_GBS | wc -l
    7461

# Only in Andean_MidAmer_GBS:
  comm -13 lis.G19833.gnm2.mrk.Escobar_Oladzad_2022 lis.G19833.gnm2.mrk.Andean_MidAmer_GBS | wc -l
    310611

# In BOTH: 
  comm -12 lis.G19833.gnm2.mrk.Escobar_Oladzad_2022 lis.G19833.gnm2.mrk.Andean_MidAmer_GBS | wc -l
    44740

# Check overlaps another way:
  cat lis* | sort | uniq -c | awk '{print $1}' | sort | uniq -c
    318072 1
    44740 2

So, we would need to create a new set that is the union of both -- either manually, to be updated (problematic), or programmatically. I could imagine a process that creates and updates a "unioned" marker collection whenever new markers are added - so that, effectively, we would have only one platform for a gensp.gnm# ... but this raises new problems.

Another case: the marker set in Glycine/max/markers/Wm82.gnm2.mrk.Tran_Steketee_2019/ has 35 markers from the SoySNP50K set, and three markers added as an assay for Soybean Cyst Nematode from a particular genetic background. So, we entered that as a study-specific marker set.

In short: I see the problem, but I don't see a simple solution - at least not on the data side.

sammyjava commented 2 years ago

Sounds good, this "issue" was more food for thought than anything, but I know you like thought food. :)

legumeinfo / datastore-issues

Escobar_Oladzad_2022 GFF is subset of Andean_MidAmer_GBS GFF #125