Closed kimrutherford closed 1 month ago
InterProScan is running now on oliver1 so apologies if Canto or JaponicusDB are a little slow for the next day (or two/three?).
Wow, it takes that long, phew!
I vastly overestimated how long it would take. It's just finished. :-)
The documentation said:
InterProScan is a computationally expensive program, sometimes taking a couple of minutes to characterise a single sequence.
so I assumed it would take many hours.
One thing that helps the speed is that if the sequence matches something already in InterPro, it just uses those results rather than re-calculating. That seems to have helped a lot.
The next task is parsing the output. It's a TSV file which is helpful. Here's a sample:
Accession | MD5 | Length | Analysis | Signature_accession | Signature_description | Start | Stop | Score | Status | Date | InterPro_annotations_accession | InterPro_annotations_description | GO_annotations | Pathway_annotations |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SPAC13G6.15c.1:pep | 3e384f1f8cb0c23464a589559bd6892c | 163 | PANTHER | PTHR10300 | CALCIPRESSIN | 7 | 153 | 6.8E-18 | T | 26-09-2024 | IPR006931 | Calcipressin | - | - |
SPAC13G6.15c.1:pep | 3e384f1f8cb0c23464a589559bd6892c | 163 | Pfam | PF04847 | Calcipressin | 6 | 143 | 2.4E-14 | T | 26-09-2024 | IPR006931 | Calcipressin | - | - |
SPBC56F2.10c.1:pep | 1348925e1231a87b6524f7ef1369cece | 322 | Gene3D | G3DSA:3.90.550.10 | Spore Coat Polysaccharide Biosynthesis Protein SpsA; Chain A | 43 | 318 | 1.6E-36 | T | 26-09-2024 | IPR029044 | Nucleotide-diphospho-sugar transferases | - | - |
SPBC56F2.10c.1:pep | 1348925e1231a87b6524f7ef1369cece | 322 | CDD | cd04188 | DPG_synthase | 66 | 287 | 3.90202E-104 | T | 26-09-2024 | IPR035518 | Dolichyl-phosphate beta-glucosyltransferase | - | - |
SPBC56F2.10c.1:pep | 1348925e1231a87b6524f7ef1369cece | 322 | SUPERFAMILY | SSF53448 | Nucleotide-diphospho-sugar transferases | 57 | 313 | 9.38E-37 | T | 26-09-2024 | IPR029044 | Nucleotide-diphospho-sugar transferases | - | - |
SPBC1718.07c.1:pep | de8b11899e677a78f30bed4d3f186adb | 404 | ProSiteProfiles | PS50103 | Zinc finger C3H1-type profile. | 326 | 354 | 17.253035 | T | 26-09-2024 | IPR000571 | Zinc finger, CCCH-type | - | - |
I've now had a better look at the results from running InterProScan manually. We mostly get the same domains but the descriptions vary a bit. Apart from that, I think the results from InterProScan will be fine.
The only thing missing is the low complexity regions. Maybe there is another tool we could use for that? I haven't investigated yet.
The predictions for disordered regions will change, possibly by a lot, because it's a different source/method.
As an example for changing descriptions, for SPBC2D10.14c we get this from InterPro XML:
cd01380 CDD class V myosin, motor domain
cd15474 CDD cargo binding domain of fungal myosin V -like proteins
(that's what's on pombase.org)
but we get this from InterProScan:
cd01380 CDD MYSc_Myo5
cd15474 CDD Myo5p-like_CBD_fungal
So it's the same hits, but a different (less human readable) descriptions.
In some cases though the description from InterProScan is better. For example (SPBC2D10.14c again), we have this on the website from the InterPro XML:
PR00193 PRINTS _____ IPR001609 Myosin head, motor domain
From InterProScan we get:
PR00193 PRINTS Myosin heavy chain signature IPR001609 Myosin head, motor domain
So in this case we get a description for the PRINTS match where we currently don't have one.
It seems that, for CDD, what we call "interPro name" is provided as "match name" (which seems to be the more human readable description) https://www.pombase.org/gene/SPBC2D10.14c
Ideally both would be provided. I'll ask InterPro to comment.
Actually, I got that wrong. There is a short name, and a long name (the InterPro description is separate) We want the long (human readable) name display, but it seems that for CDD the short name is provided instead. Is that correct?
As far as I can tell, for each match there's an ID (like "PTHR23065") and a match name (like "PROLINE-SERINE-THREONINE PHOSPHATASE INTERACTING PROTEIN 1") and then mostly there is an InterPro ID and an InterPro name. Occasionally there is no match name and slightly more often there is no InterPro ID and name.
Example: https://www.pombase.org/gene/SPAC20G8.05c
What we see in the Protein families and domains table on the gene pages is more or less the same as the columns in the output of InterProtScan.
We want the long (human readable) name display, but it seems that for CDD the short name is provided instead. Is that correct?
Sorry, I don't know what CDD is. I don't know if they provide short and long names.
Here's my first attempt to parse and display the InterProScan output. There is a bit of work to do:
(This is SPBC16A3.15c / nda2 - picked rather at random)
This is what we have currently for the same gene:
(Sorry the tracks aren't in the same order - that needs some work)
I tried running InterProScan using the option to get the output in JSON format instead of TSV. There is more information in the JSON file, including more detail about the disorder predictions from MobiDB.
So I've re-written the parser to read InterProScan JSON instead of TSV.
That has helped with the MobiDB data. In the TSV file the disordered locations had no disorder type attached so we ended up with this:
After processing the JSON file we can put the different predictions in different tracks:
There are several different prediction types from MobiDB. I don't know if should show them all. Perhaps we could discuss that on the next call.
Disorder types:
This gene has examples of some of the less common types:
There is also information in the JSON output about discontinuous features like the SUPERFAMILY feature here: pho84 / SPBC8E4.01c https://www.ebi.ac.uk/interpro/protein/reviewed/O42885/
We should be able to show those sort of features in the same way. I haven't implemented that yet though.
I think we should show the different types, it could be very useful.
Here https://desktop.kmr.nz/gene/SPBC1604.12 in the InterPro part we do not display a consensus row (but it is in the table) . I don't know if you plan to include
we do not display a consensus row (but it is in the table) .
What do you mean by consensus row?
I mean a "union" row with every disordered region.
Like the MOBIDB-LITE row here, which I guess has everything?
Like the MOBIDB-LITE row here, which I guess has everything?
Sorry, got it now. I'll investigate that. It does show on other gene pages.
Sorry, got it now. I'll investigate that. It does show on other gene pages.
I was imaging that it showed up on other pages. :-)
It's fixed now. You might need to shift-reload.
I've updated my desktop version so that is now only one coiled-coils track. I've removed the Pfam coils version. The whole site (on my desktop) including the query builder now use the coils features from InterProScan.
The two data sets are mostly similar. There will be differences when the sequence has changed and in that case the InterProScan will be hopefully be more accurate.
As far as I can tell both datasets were generated with "COILS", which I think is this old, unmaintained software: https://bio.tools/coils
Oddly, there are some prediction differences so maybe the algorithm changed (improved?) at some point. Two examples where the prediction covers more of the residues in the data from InterProScan:
pfd6:
I've now replaced the disordered regions from Pfam with the MobiDB predictions in the query buidler. There are some quite big differences which will mean query results are different.
For example:
I'll explain that in the announcement...
I have now removed the last use of the data downloaded from Pfam a couple of years ago. The final step was to use segmasker to generate low complexity regions for the query builder and the protein feature viewers.
The results are mostly the same except for a few cases where the new regions overlap.
As an example, here's what we used to have for pfl2 / SPAPB15E9.01c: https://www.pombase.org/gene_protein_features/SPAPB15E9.01c
We now have this: https://desktop.kmr.nz/gene_protein_features/SPAPB15E9.01c
That's not great for the query builder. The query for percent of protein covered by low complexity produces the wrong result.
I think the solution is to merge overlapping regions. I'll do that tomorrow.
I agree to merge the overlapping features
The current low complexity regions are a bit weird: https://www.pombase.org/gene_protein_features/SPAPB15E9.01c
There is no gap between most of the regions of that gene. I think perhaps Pfam did some post-processing of the segmasker results to prevent overlaps. I think they should have just merged them.
I agree to merge the overlapping features
I've done that and it's much better now:
nice!
I think we can close this. The only remaining issue is the missing descriptions for the CDD matches. I've asked about that:
I'm running InterProScan for japonicus now.
The new run_and_process_interpro.sh
script will process pombe and japonicus proteins.
Documentation: https://github.com/pombase/pombase-chado/wiki/Updating-InterPro
From:
Val:
Kim: