Attempt to install and run InterProScan

kimrutherford commented 2 months ago

From:

https://github.com/pombase/pombase-chado/issues/52#issuecomment-2331756653

Val:

I mailed Interpro to ask if this is possible, but now I started worrying that we have a lot of features from intoPro that incorrect coordinates based on the current Pase sequences (because they are based on UniProt) i.e. everything that has any features does have coordinate changes. This will include InterPro domains but also the coil-coil etc. How much hassle would it be, instead of updating with every InterPro release, to run InterProScan locally a couple of times a year ? I would rather the features were accurate coordinates, but maybe lagging a little bit behind InterPro in content ( because there is never too much new stuff for pombe in a release)

Kim:

Last time I tried to install InterProScan it was too hard and I failed. That was a long time ago though and they now provide a helpful Docker image. I'll give it a go. There will be a bit of downstream work because the output format of InterProScan isn't the same as the XML file from InterPro.

kimrutherford commented 2 months ago

InterProScan is running now on oliver1 so apologies if Canto or JaponicusDB are a little slow for the next day (or two/three?).

ValWood commented 2 months ago

Wow, it takes that long, phew!

kimrutherford commented 2 months ago

I vastly overestimated how long it would take. It's just finished. :-)

The documentation said:

InterProScan is a computationally expensive program, sometimes taking a couple of minutes to characterise a single sequence.

so I assumed it would take many hours.

One thing that helps the speed is that if the sequence matches something already in InterPro, it just uses those results rather than re-calculating. That seems to have helped a lot.

The next task is parsing the output. It's a TSV file which is helpful. Here's a sample:

Accession	MD5	Length	Analysis	Signature_accession	Signature_description	Start	Stop	Score	Status	Date	InterPro_annotations_accession	InterPro_annotations_description	GO_annotations	Pathway_annotations
SPAC13G6.15c.1:pep	3e384f1f8cb0c23464a589559bd6892c	163	PANTHER	PTHR10300	CALCIPRESSIN	7	153	6.8E-18	T	26-09-2024	IPR006931	Calcipressin	-	-
SPAC13G6.15c.1:pep	3e384f1f8cb0c23464a589559bd6892c	163	Pfam	PF04847	Calcipressin	6	143	2.4E-14	T	26-09-2024	IPR006931	Calcipressin	-	-
SPBC56F2.10c.1:pep	1348925e1231a87b6524f7ef1369cece	322	Gene3D	G3DSA:3.90.550.10	Spore Coat Polysaccharide Biosynthesis Protein SpsA; Chain A	43	318	1.6E-36	T	26-09-2024	IPR029044	Nucleotide-diphospho-sugar transferases	-	-
SPBC56F2.10c.1:pep	1348925e1231a87b6524f7ef1369cece	322	CDD	cd04188	DPG_synthase	66	287	3.90202E-104	T	26-09-2024	IPR035518	Dolichyl-phosphate beta-glucosyltransferase	-	-
SPBC56F2.10c.1:pep	1348925e1231a87b6524f7ef1369cece	322	SUPERFAMILY	SSF53448	Nucleotide-diphospho-sugar transferases	57	313	9.38E-37	T	26-09-2024	IPR029044	Nucleotide-diphospho-sugar transferases	-	-
SPBC1718.07c.1:pep	de8b11899e677a78f30bed4d3f186adb	404	ProSiteProfiles	PS50103	Zinc finger C3H1-type profile.	326	354	17.253035	T	26-09-2024	IPR000571	Zinc finger, CCCH-type	-	-

kimrutherford commented 1 month ago

I've now had a better look at the results from running InterProScan manually. We mostly get the same domains but the descriptions vary a bit. Apart from that, I think the results from InterProScan will be fine.

The only thing missing is the low complexity regions. Maybe there is another tool we could use for that? I haven't investigated yet.

The predictions for disordered regions will change, possibly by a lot, because it's a different source/method.

As an example for changing descriptions, for SPBC2D10.14c we get this from InterPro XML:

 cd01380     CDD     class V myosin, motor domain
 cd15474     CDD     cargo binding domain of fungal myosin V -like proteins

(that's what's on pombase.org)

but we get this from InterProScan:

 cd01380      CDD     MYSc_Myo5                           
 cd15474      CDD     Myo5p-like_CBD_fungal

So it's the same hits, but a different (less human readable) descriptions.

In some cases though the description from InterProScan is better. For example (SPBC2D10.14c again), we have this on the website from the InterPro XML:

PR00193   PRINTS      _____      IPR001609    Myosin head, motor domain

From InterProScan we get:

PR00193    PRINTS    Myosin heavy chain signature    IPR001609    Myosin head, motor domain

So in this case we get a description for the PRINTS match where we currently don't have one.

ValWood commented 1 month ago

It seems that, for CDD, what we call "interPro name" is provided as "match name" (which seems to be the more human readable description) https://www.pombase.org/gene/SPBC2D10.14c

Ideally both would be provided. I'll ask InterPro to comment.

ValWood commented 1 month ago

Actually, I got that wrong. There is a short name, and a long name (the InterPro description is separate) We want the long (human readable) name display, but it seems that for CDD the short name is provided instead. Is that correct?

kimrutherford commented 1 month ago

As far as I can tell, for each match there's an ID (like "PTHR23065") and a match name (like "PROLINE-SERINE-THREONINE PHOSPHATASE INTERACTING PROTEIN 1") and then mostly there is an InterPro ID and an InterPro name. Occasionally there is no match name and slightly more often there is no InterPro ID and name.

Example: https://www.pombase.org/gene/SPAC20G8.05c

What we see in the Protein families and domains table on the gene pages is more or less the same as the columns in the output of InterProtScan.

We want the long (human readable) name display, but it seems that for CDD the short name is provided instead. Is that correct?

Sorry, I don't know what CDD is. I don't know if they provide short and long names.

kimrutherford commented 1 month ago

Here's my first attempt to parse and display the InterProScan output. There is a bit of work to do:

(This is SPBC16A3.15c / nda2 - picked rather at random)

This is what we have currently for the same gene:

(Sorry the tracks aren't in the same order - that needs some work)

ValWood commented 1 month ago

iQ5 https://desktop.kmr.nz/gene/SPBC2D10.14c

kimrutherford commented 1 month ago

I tried running InterProScan using the option to get the output in JSON format instead of TSV. There is more information in the JSON file, including more detail about the disorder predictions from MobiDB.

So I've re-written the parser to read InterProScan JSON instead of TSV.

That has helped with the MobiDB data. In the TSV file the disordered locations had no disorder type attached so we ended up with this:

After processing the JSON file we can put the different predictions in different tracks:

There are several different prediction types from MobiDB. I don't know if should show them all. Perhaps we could discuss that on the next call.

Disorder types:

~default~ Consensus
Negative-Polyelectrolyte
Polar
Polyampholyte
Positive-Polyelectrolyte
Proline-rich

This gene has examples of some of the less common types:

kimrutherford commented 1 month ago

There is also information in the JSON output about discontinuous features like the SUPERFAMILY feature here: pho84 / SPBC8E4.01c https://www.ebi.ac.uk/interpro/protein/reviewed/O42885/

We should be able to show those sort of features in the same way. I haven't implemented that yet though.

https://desktop.kmr.nz/gene_protein_features/SPBC8E4.01c

ValWood commented 1 month ago

I think we should show the different types, it could be very useful.

Here https://desktop.kmr.nz/gene/SPBC1604.12 in the InterPro part we do not display a consensus row (but it is in the table) . I don't know if you plan to include

kimrutherford commented 1 month ago

we do not display a consensus row (but it is in the table) .

What do you mean by consensus row?

ValWood commented 1 month ago

I mean a "union" row with every disordered region.

Like the MOBIDB-LITE row here, which I guess has everything?

kimrutherford commented 1 month ago

Like the MOBIDB-LITE row here, which I guess has everything?

Sorry, got it now. I'll investigate that. It does show on other gene pages.

kimrutherford commented 1 month ago

Sorry, got it now. I'll investigate that. It does show on other gene pages.

I was imaging that it showed up on other pages. :-)

It's fixed now. You might need to shift-reload.

https://desktop.kmr.nz/gene/SPBC1604.12

kimrutherford commented 1 month ago

I've updated my desktop version so that is now only one coiled-coils track. I've removed the Pfam coils version. The whole site (on my desktop) including the query builder now use the coils features from InterProScan.

The two data sets are mostly similar. There will be differences when the sequence has changed and in that case the InterProScan will be hopefully be more accurate.

As far as I can tell both datasets were generated with "COILS", which I think is this old, unmaintained software: https://bio.tools/coils

Oddly, there are some prediction differences so maybe the algorithm changed (improved?) at some point. Two examples where the prediction covers more of the residues in the data from InterProScan:

pfd6:

kimrutherford commented 1 month ago

I've now replaced the disordered regions from Pfam with the MobiDB predictions in the query buidler. There are some quite big differences which will mean query results are different.

For example:

Pfam: https://www.pombase.org/gene/SPAC22F8.10c

MobiDB: https://desktop.kmr.nz/gene/SPAC22F8.10c

ValWood commented 1 month ago

I'll explain that in the announcement...

kimrutherford commented 1 month ago

I have now removed the last use of the data downloaded from Pfam a couple of years ago. The final step was to use segmasker to generate low complexity regions for the query builder and the protein feature viewers.

The results are mostly the same except for a few cases where the new regions overlap.

As an example, here's what we used to have for pfl2 / SPAPB15E9.01c: https://www.pombase.org/gene_protein_features/SPAPB15E9.01c

We now have this: https://desktop.kmr.nz/gene_protein_features/SPAPB15E9.01c

That's not great for the query builder. The query for percent of protein covered by low complexity produces the wrong result.

I think the solution is to merge overlapping regions. I'll do that tomorrow.

ValWood commented 1 month ago

I agree to merge the overlapping features

kimrutherford commented 1 month ago

The current low complexity regions are a bit weird: https://www.pombase.org/gene_protein_features/SPAPB15E9.01c

There is no gap between most of the regions of that gene. I think perhaps Pfam did some post-processing of the segmasker results to prevent overlaps. I think they should have just merged them.

kimrutherford commented 1 month ago

I agree to merge the overlapping features

I've done that and it's much better now:

ValWood commented 1 month ago

nice!

kimrutherford commented 1 month ago

I think we can close this. The only remaining issue is the missing descriptions for the CDD matches. I've asked about that:

ebi-pf-team/interproscan#385

kimrutherford commented 1 month ago

I'm running InterProScan for japonicus now.

The new run_and_process_interpro.sh script will process pombe and japonicus proteins. Documentation: https://github.com/pombase/pombase-chado/wiki/Updating-InterPro

pombase / pombase-chado

Attempt to install and run InterProScan #1218