legumeinfo / jira-issues

placeholder repo for issues migrating from JIRA system, to be moved to their appropriate places later
0 stars 0 forks source link

Remove links between gene builds and chromosomes/scaffolds #694

Open adf-ncgr opened 7 years ago

adf-ncgr commented 7 years ago

There are analysisfeature records linking chromosomes and scaffolds to gene builds. Since gene builds don't define chromosomes, these should be removed from Chado.

[LEGUME-728] created by ecannon

adf-ncgr commented 7 years ago

note that this may actually be a task for peanutbase not legumeinfo, though it is probably worth verifying that all species are similar to what I described for vigun and lupan (see GH-727 Done ), since these two are among the most recently loaded species, and my very dim memory of cleaning up everything else while I was at it may have been a case of good intentions unfulfilled in reality...

by adf_ncgr

adf-ncgr commented 7 years ago

Looks like a task for both PeanutBase and LegumeInfo, and that it should perhaps be expanded a bit. If one objective is to remove unnecessary analysisfeature records, then gene build analysisfeature records should be removed from features of type exon, mRNA, and polypeptide. For example:

drupal=> select count, a.name, t.name from feature f, analysisfeature af, analysis a, cvterm t where f.type_id=t.cvterm_id and f.feature_id=af.feature_id and af.analysis_id=a.analysis_id and a.name like 'vigan.%' group by a.name, t.name;
count | name | name
--------------------------------------------
11 | vigan.Gyeongwon.gnm3 | chromosome
3376 | vigan.Gyeongwon.gnm3 | supercontig
160281 | vigan.Gyeongwon.gnm3.ann1 | exon
26857 | vigan.Gyeongwon.gnm3.ann1 | gene
36689 | vigan.Gyeongwon.gnm3.ann1 | mRNA
36689 | vigan.Gyeongwon.gnm3.ann1 | polypeptide

by ecannon

adf-ncgr commented 7 years ago

I'm not sure that those should be removed. One objective in having the links is to enable easy identification of the features added from a particular analysis- this would facilitate easy deletion of featuresets deemed no longer of interest, for example; also, probably some overview reports of content along the lines of the things that Connor has been working on. I guess it depends on what we consider necessary, but I am viewing it as being more than support of the gene page.

by adf_ncgr

adf-ncgr commented 7 years ago

I'm okay with leaving the exon, mRNA, and polypetide links, but note that they are inconsistent:

drupal=> select count, a.name, t.name from feature f, analysisfeature af, analysis a, cvterm t where f.type_id=t.cvterm_id and f.feature_id=af.feature_id and af.analysis_id=a.analysis_id and a.name like 'phavu.%' group by a.name, t.name;
count | name | name
----------------------------------------
11 | phavu.G19833.gnm1 | chromosome
697 | phavu.G19833.gnm1 | supercontig
27197 | phavu.G19833.gnm1.ann1 | gene

drupal=> select count, a.name, t.name from feature f, analysisfeature af, analysis a, cvterm t where f.type_id=t.cvterm_id and f.feature_id=af.feature_id and af.analysis_id=a.analysis_id and a.name like 'glyma.%' group by a.name, t.name;
count | name | name
--------------------------------------
20 | glyma.Wm82.gnm2 | chromosome
1170 | glyma.Wm82.gnm2 | supercontig
56044 | glyma.Wm82.gnm2.ann1 | gene

drupal=> select count, a.name, t.name from feature f, analysisfeature af, analysis a, cvterm t where f.type_id=t.cvterm_id and f.feature_id=af.feature_id and af.analysis_id=a.analysis_id and a.name like 'araip.%' group by a.name, t.name;
count | name | name
----------------------------------------
10 | araip.K30076.gnm1 | chromosome
1183 | araip.K30076.gnm1 | supercontig
10 | araip.K30076.gnm1.ann1 | chromosome
42533 | araip.K30076.gnm1.ann1 | gene
1183 | araip.K30076.gnm1.ann1 | supercontig

drupal=> select count, a.name, t.name from feature f, analysisfeature af, analysis a, cvterm t where f.type_id=t.cvterm_id and f.feature_id=af.feature_id and af.analysis_id=a.analysis_id and a.name like 'lupan.%' group by a.name, t.name;
count | name | name
--------------------------------------------------
114724 | lupan.Tanjil.a1.0.iprscan | protein_hmm_match
151126 | lupan.Tanjil.a1.0.iprscan | protein_match
20 | lupan.Tanjil.gnm1 | chromosome
13554 | lupan.Tanjil.gnm1 | supercontig
182583 | lupan.Tanjil.gnm1.ann1 | exon
33072 | lupan.Tanjil.gnm1.ann1 | gene
33072 | lupan.Tanjil.gnm1.ann1 | mRNA
33072 | lupan.Tanjil.gnm1.ann1 | polypeptide

by ecannon

adf-ncgr commented 7 years ago

good point; I suspect that is due to differences in how the earlier genomes were loaded- the newer
ones are using the loader itself to get these analysisfeature linkages established, I think most of the
older ones got them added after the fact through a manual process. I'll take on the task of consistification for legumeinfo unless you want to...

by adf_ncgr

adf-ncgr commented 7 years ago

It's all yours! I'll tackle peanutbase.

by ecannon