J regia has THREE genomes

bradfordcondon commented 6 years ago

Bug description

https://www.hardwoodgenomics.org/organism/Juglans/regia?tripal_pane=group_reference_genome

not a bug necessarily, but certainly a casuality of #345

almasaeed2010 commented 6 years ago

Looks like we want to delete the old one since it has less info. How do we go about doing that? Anything we need to check before deleting an analysis (like if there are features associated with)?

bradfordcondon commented 6 years ago

we can archive the old one if t's truly a different assembly version. Before doing so, we should ensure there are no features assocated with it.

bradfordcondon commented 6 years ago

similarly red oak has 3 transcriptomes. On the organism page the labels are correct:

https://www.hardwoodgenomics.org/organism/Quercus/rubra?tripal_pane=group_transcriptome

however in the Tripal view:

why arent the titles [archived] ?

if you edit the content, its [archived].

I should update the instructions to say to include [archived] at the start of a name if its an older one...

bradfordcondon commented 6 years ago

I think its clear this analysis is the correct one: https://www.hardwoodgenomics.org/Genome-assembly/2209485?tripal_pane=group_downloads

This one: https://www.hardwoodgenomics.org/Genome-assembly/1963051?tripal_pane=group_downloads has deadlinks, etc.
are any features linked to it? if no then we can just delete and/or archive.

bradfordcondon commented 6 years ago

 select count(*) from chado.analysisfeature where analysis_id = 125;
220340

select count(*) from chado.analysisfeature where analysis_id = 150;
64992

select count(*) from chado.analysisfeature where analysis_id = 191;
0

records: 125, 150, 191. 125 and 150 look almost identical in the analysis table.

 select * from chado_bio_data_21;
mapping_id  entity_id   record_id   nid
1   1962952 17
2   1962953 51
3   1962958 50
4   1963052 151
5   1963053 125
6   1963056 146
7   1963051 150
8   1919890 152
9   1963058 154
10  2161592 157
11  2209433 161
12  2209485 191

Note all three are genome assemblies.

125 ---> https://hardwoodgenomics.org/bio_data/1963053
150 ---> https://hardwoodgenomics.org/bio_data/1963051
191 ---> https://hardwoodgenomics.org/bio_data/2209485

125 and 150 are identical, but 150 is formatted better. Why does 125 have any features at all?

ALL THREE are bioproject 291087. this means that the best and correct course of action is to transfr all features to analysis 191 and delete analysis 125 and 150.

bradfordcondon commented 6 years ago

analysis foreign keys to worry about:

quantification, project_a , phylotree, nd_experiment_a, aprop, afeature, a_relationship, a_pub, a_dbxref,a_cvterm.

I think only afeature is of relevence.

checking like so: select count(*) from chado.quantification where analysis_id in(125, 150, 191);

bradfordcondon commented 6 years ago

Only analysisprop and analysisfeature

select count(*) from chado.analysisprop where analysis_id in(125, 150, 191);
9

Here are the props:

hardwoods_06112018=> select * from chado.analysisprop ap INNER JOIN chado.cvterm cvt ON cvt.cvterm_id = ap.type_id where analysis_id in(125, 150, 191);
analysisprop_id analysis_id type_id value   rank    cvterm_id   cv_idname   definition  dbxref_id   is_obsolete is_relationshiptype
4448    125 2005    Juglans_regia_01182017  0   2005    6   analysis_unigene_name   The name for a unigene. 2482    0   0
4449    125 2006        0   2006    6   analysis_unigene_num_contigs    The number of contigs in the unigene assembly   2483    0   0
4450    125 2009        0   2009    6   analysis_unigene_num_reads  The number of reads, after filtering, used as input for the assembly    2486    0   0
4451    125 2010        0   2010    6   analysis_unigene_avg_length The average contig length   2487    0   0
4452    125 2008        0   2008    6   analysis_unigene_num_clusters   The number of clusters in the unigene assembly  2485    0   0
4453    125 2007        0   2007    6   analysis_unigene_num_singlets   The number of singlets remaining in the unigene assembly    2484    00
4489    125 2063    tripal_analysis_unigene 0   2063    16  Analysis Type   The type of analysis was performed. 2540    0   0
4525    125 29  genome_assembly 0   29  15  analysis_typeThe type of analysis was performed. This value is automatically set by each Tripal Analysis module and should be equal to the module name (e.g. tripal_analysis_blast, tripal_analysis_go).    29  0   0
4530    150 29  genome_assembly 0   29  15  analysis_typeThe type of analysis was performed. This value is automatically set by each Tripal Analysis module and should be equal to the module name (e.g. tripal_analysis_blast, tripal_analysis_go).    29  0   0
(9 rows)

bradfordcondon commented 6 years ago

Something is very wrong.

analysis 125 has 16852 genes and mRNA. analysis 150 has 32496 mRNA and polypeptide. 125 also has the supercontigs: 186636 of them.

select analysis_id, type_id, cvt.name, count(type_id) from chado.analysisfeature inner join chado.feature on feature.feature_id = analysisfeature.feature_id INNER JOIN chado.cvterm cvt ON cvt.cvterm_id = feature.type_id where analysis_id in (125, 150, 191) group by cvt.name, analysis_id, type_id;
analysis_id    type_id    name    count
125    215    gene    16852
125    145    mRNA    16852
150    145    mRNA    32496
150    236    polypeptide    32496
125    290    supercontig    186636

bradfordcondon commented 6 years ago

matt has re-annotated the IPS. THey are in /var/www/html/sites/default/files/IPS_aug_17_2018

What about blast? I'm guessing we didn't do that.

almasaeed2010 commented 6 years ago

Yeah we are gonna need blast too. Since ACF is gonna be down, is it possible to use Staton server? Or is that also going through prepping the new drives?

almasaeed2010 commented 6 years ago

Let's start loading the ips files at least. Did we already delete the old analyses?

bradfordcondon commented 6 years ago

no... we can delete the two older analyses, keep the one with no records, and reload pointing to that one.

bradfordcondon commented 6 years ago

so: delete the two below analysis entities and chado records:

analysis_id - 125 ---> https://hardwoodgenomics.org/bio_data/1963053
150 ---> https://hardwoodgenomics.org/bio_data/1963051

almasaeed2010 commented 6 years ago

its real easy. delete teh two anlayes i say to delete in the issue, delete all features

reload, associating with the third not-deleted analysis

almasaeed2010 commented 6 years ago

Steps:

[x] Delete old analyses (125, 150)
[x] Delete all features
[x] Load mRNA
[x] Load proteins and associate to mRNA
[x] Publish mRNA entities
[x] Load IPR Scans
[x] ~Update genes index~ (not needed ES will update automatically after the importer is done)

almasaeed2010 commented 6 years ago

Oh I need to somehow find and delete the old mRNA entities. Since Tripal doesn't provide an unpublish method yet. We need to figure out a way to identify them now since the actual features are deleted from chado so there is no association to an organism. I think a simple LIKE query can return those entities but since we also want trigger delete hooks so they are removed from whatever index uses them, I am gonna write a little script to do it.

almasaeed2010 commented 6 years ago

Ok I am adding a new very simple but super useful feature to tripal alchemist to clear orphaned entities, entities that have no associated chado records. This feature, if good enough can then be ported over to the main tripal repo.

almasaeed2010 commented 6 years ago

Publishing mRNAs job: https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/443269

almasaeed2010 commented 6 years ago

This issue on tripal_manage_analysis will fix the fields pointing to both deleted analyses: https://github.com/statonlab/tripal_manage_analyses/issues/42

almasaeed2010 commented 6 years ago

So checking the files in /var/www/html/sites/default/files/IPS_aug_17_2018 looks like those are files for the entire site. We only want those for j regia. Have those been added to the server?

almasaeed2010 commented 6 years ago

Ok IPR XMLs are now at: sites/default/files/sequences/englishWalnut01182017/IPR

almasaeed2010 commented 6 years ago

The xml files have the names abbreviated from Juglans_regia_01182017_WALNUT_00003300-RA_mRNA to WALNUT_00003300-RA. I'll run sed on them to add the missing part. This will keep JBrowse happy too.

almasaeed2010 commented 6 years ago

IPR job: https://hardwoodgenomics.org/admin/tripal/tripal_jobs/view/443277 (DONE)

almasaeed2010 commented 6 years ago

IPR is done and correctly linked:

So now we are only missing the BLAST for this. Let's create a separate issue though with a more related title.

The main issue of having 3 analyses is completed!

statonlab / hardwoods_site

J regia has THREE genomes #390

Bug description