sfu-ireceptor / dataloading-mongo

Scripts to load data into a MongoDB service repository
GNU Lesser General Public License v3.0
0 stars 3 forks source link

Data loader expects human gene nomenclature #44

Open bussec opened 3 years ago

bussec commented 3 years ago

The data loader (dataload/annotation.py, around line 250--300) assumes that gene calls use a human gene nomenclature format (e.g., IGHV1-23*04), including an all-caps gene name. Non-compliant calls will simply be dropped. This creates problems for mouse datasets, even if they use IMNC nomenclature instead of legacy naming schemes (e.g., Johnston et al.).

bcorrie commented 3 years ago

This is our attempt at making gene calls comparable 8-) We need a mechanism that handles the idiosyncrasies of the annotation tools gene calling and creates something that is "Interoperable" and "Reusable". We also use an internal mechanism to try and build gene names to build allele -> gene -> family relationships. As you know, this has been discussed at length (ad nauseum?)

https://github.com/airr-community/airr-standards/pull/295

Don't get me started 8-)

Our goal here is at a minimum to ensure that any data in any two Turnkey repositories is interoperable and reusable. So we do force this to some degree, as we use the IMGT nomenclature for human genes. This works well for most annotation tools for human data.

I admit we don't have a lot of mouse data, so we may need to modify our mechanism for determining valid gene names for mouse (and other species). At the same time, I would stand pretty strongly behind the premise that once data is loaded into an iReceptor Turnkey that the gene names need to be comparable. We can do some of that (we already convert gene names from various annotation tools to a consistent format), but we have to put some onus on the researcher to provide us with a reasonable starting point. Note that the Turnkey will happily load custom fields, so you can still store your original gene names in custom fields, but the v_call/d_call/j_call need to be well defined to start with.

So I think we need some help in determining what that starting point is for mouse gene names - and we can certainly make some changes to load that data more easily for the user. But what is that starting point - and shouldn't that starting point be mentioned as part of the AIRR Spec?

bussec commented 3 years ago

At the same time, I would stand pretty strongly behind the premise that once data is loaded into an iReceptor Turnkey that the gene names need to be comparable.

I fully agree with that, as this is the idea of the whole standardization exercise ;-) My point is that standardization does not mean that you have to toss species-specific nomenclature out of the window -- as long as this also follows a standard. Mouse needs to be matched with mouse and human with human, but I do not see why both species would need to use ALLCAPS gene symbols. As a permissive sanity check for VDJ genes we use /^(Ig[hkl]|Tr[abdg])[vdj][1-9].*/ for mouse and the all-caps version for human.

In general I would like to avoid using custom fields, as it will IMO lead to less compatibility in the long run. I think that the solution is a proper germline gene ontology, but that's a discussion for another issue ;-)

bcorrie commented 3 years ago

At the same time, I would stand pretty strongly behind the premise that once data is loaded into an iReceptor Turnkey that the gene names need to be comparable.

I fully agree with that, as this is the idea of the whole standardization exercise ;-) My point is that standardization does not mean that you have to toss species-specific nomenclature out of the window -- as long as this also follows a standard. Mouse needs to be matched with mouse and human with human, but I do not see why both species would need to use ALLCAPS gene symbols. As a permissive sanity check for VDJ genes we use /^(Ig[hkl]|Tr[abdg])[vdj][1-9].*/ for mouse and the all-caps version for human.

Yeah, but why, why, why do they need to be different when they could have been the same 8-) I know its too late, but mixing and matching just makes it harder for everyone... Sour grapes in regards to biologists and standards, I know, but really... 8-)

In general I would like to avoid using custom fields, as it will IMO lead to less compatibility in the long run.

Yes, didn't mean that you should use them for key fields, but whenever we do a conversion (e.g. for the Adaptive data) we keep custom fields using the original nomenclature so that it is possible to see how the conversion was done - in case you think we messed it up 8-)

I think that the solution is a proper germline gene ontology, but that's a discussion for another issue ;-)

8-)