legumeinfo / jira-issues

placeholder repo for issues migrating from JIRA system, to be moved to their appropriate places later
0 stars 0 forks source link

data issue with bean markers: extra map positions of 0 cM associated #419

Closed adf-ncgr closed 8 years ago

adf-ncgr commented 8 years ago

noticed this when reviewing Sudhansu's NAPIA slides, in which one marker had a position listed as 0 cM; while not impossible, it seemed odd, since the other map on which it was placed had
a more "normal" value. Just got around to querying the db and see that many (though not all) markers seem to have two positions on the same map, one of them having a mappos value of 0,
the other being "normal", e.g.:
drupal=> select * from featurepos;
featurepos_id | featuremap_id | feature_id | map_feature_id | mappos
----------------------------------------------------------
1369 | 53 | 2558235 |2558235 | 0
1370 | 53 | 2558235 |2558235 | 200.46
1371 | 53 | 2558236 |2558236 | 0
1372 | 53 | 2558236 |2558236 | 156.31
1373 | 53 | 2558237 |2558237 | 0
1374 | 53 | 2558237 |2558237 | 194.45
1375 | 53 | 2558238 |2558238 | 0
1376 | 53 | 2558238 |2558238 | 130.35

etc.
seems likely to be a dataloading bug, but I'm only vaguely familiar with this process so will leave it as a conjecture...

[LEGUME-451] created by adf_ncgr

adf-ncgr commented 8 years ago

The majority of features with both a 0 and non-0 featurepos.mappos values are linkage groups.

There are, however, 4 markers with 0 and non-0 featurepos.mappos values:
3248437 - BM152
positions: DOR364_x_BAT477_a-B02: 0, Cerinza_x_G24404_a-B02: 113.9,
lg lengths: Cerinza_x_G24404_a-B02: 118.8, DOR364_x_BAT477_a-B02: 92.2
--> might be because Cerinza_x_G24404_a-B02 is reversed relative to DOR364_x_BAT477_a-B02 ... or bad data.
3248502 - BMd28
position: DOR364_x_BAT477_a-B05: 0 , DOR364_x_BAT477_a-B10: 106.4 , Cerinza_x_G24404_a-B05: 63.0,
lg lengths: Cerinza_x_G24404_a-B05: 63.9, DOR364_x_BAT477_a-B10: 33.6 (??)
--> looks like bad data
3248542 - BM211
positions: Cerinza_x_G24404_a-B08: 0, DOR364_x_BAT477_a-B08: 59.7,
lg lengths: Cerinza_x_G24404_a-B08: 18.2, DOR364_x_BAT477_a-B08: 90.4
--> bad data?
3248561 - BM114,
positions: DOR364_x_BAT477_a-B09: 0, Cerinza_x_G24404_a-B09: 92.4,
lg lengths: Cerinza_x_G24404_a-B09: 129.9, DOR364_x_BAT477_a-B09: 134.7
--> bad data?

by ecannon

adf-ncgr commented 8 years ago

I see, I didn't quite understand the modeling of linkage groups as features in this way; I guess we
just got unlucky to have chosen BM152 (one of the 4 listed as potentially problematic) for the slides.

One question about the modeling of genetic data, thinking a little bit ahead to what we're learning about
the chado->intermine conversion: I can see now that featurepos is being used both for placing genetic markers
on linkage groups, as well as to place linkage groups within featuremaps and define their boundaries
(with respect to themselves as the map feature). On the other hand, it looks like QTLs are being handled as featurelocs, presumably so that fmin and fmax can be used together. Wondering if it would make sense to try to put everything "genetic" in the same context, by modeling QTLs as having 2 featurepos as is being done for linkage group boundaries? There's also the featurerange table which appears to be used for genetic entities defined by flanking features (which I guess is currently being modeled for QTLs as feature_relationships of the QTL to the flanking markers?) Alternatively, if it makes sense to stick with featureloc for QTLs, then maybe we should consider also putting marker positions there as well, in order to minimize the
convoluted logic for range-based queries on genetic maps?
NB: I don't want to cause unnecessary churn by suggesting changes that have no real benefit! but, I do think the current situation is a bit confusing (as is typical of chado) and possibly some downstream pain could be avoided by considering some modest changes (of course, it's not my code that would be affected, so this is quite easy for me to say!!)

by adf_ncgr

adf-ncgr commented 8 years ago

Marker positions are provided by two spreadsheets,
AsfawBlair2012_PhotosynthateAcquisitionRemobilizationDrought_G3_v07js.xslx and BlairGaleano2012_v22js.xslx.

The two sets of marker positions are completely different, and although the marker names substantially overlap, they are not identical in the two datasets.

by ecannon

adf-ncgr commented 8 years ago

Hmm. Yes, consistency is nice. The QTL were saved as featurepos records attached to featureposprop records of type 'start' and 'stop' but changed to the featureloc as that seemed silly. But if thinking of consistency of genetic vs genomic positions, it makes sense again. Will revisit when discussions on the new QTL data are revived in April.

by ecannon

adf-ncgr commented 8 years ago

The problems appear to trace to the map DOR364_x_BAT477_a (publication Blair, Galeano et al., 2012). Need to trace whether the map/marker data was duplicated in another publication incorrectly, if there is corrupt data lying around, or if there is an error in the loading script.

by ecannon

adf-ncgr commented 8 years ago

This was pretty nasty to track down. The problems (which were more extensive than expected) came down to a map-naming problem. There were two maps named DOR364_x_BAT477_a, in publications Blair, Galeano et al. 2012a, and Asfaw, Blair et al., 2012a. The latter map was renamed to DOR364_x_BAT477_b, but apparently neither publication was reloaded after the change.

Blair, Galeano et al. 2012a has been reloaded and its markers now appear to be correct.

While tracking this problem, found several markers with 0 and non-0 positions for different maps. These all came to either near-0 positions or linkage groups that were flipped relative to each other.

There may still be markers placed on different linkage groups. These are likely to be in the primary data but should be examined if found.

by ecannon

adf-ncgr commented 6 years ago

Relates to: GH-459