oceanproteinportal / api

A Python API module for the Ocean Protein Portal
1 stars 0 forks source link

Protein Search taking over 20 seconds #11

Closed kaimikacolin closed 1 year ago

kaimikacolin commented 2 years ago

This query is taking over 20 seconds to return. Is there any way to optimize this?

SELECT DISTINCT ?protein ?product_name ?dataset ?dataset_name ?expedition ?spectral_count ?protein_id ?peptide_count ?sequence ?ids
        WHERE { 
        ?protein a <http://schema.oceanproteinportal.org/v2/views/ProteinFound> .
        ?protein ?p ?o .
        FILTER REGEX(?o, "AA", "i")

        ?protein view:productName ?product_name .
        ?protein view:identifiers ?ids .
        ?protein view:datasetName ?dataset_name .
        ?protein view:dataset ?dataset .
        ?protein view:expedition ?expedition .
        ?protein view:spectralCount ?spectral_count .
        ?protein view:proteinID ?protein_id .
        ?protein view:peptideCount ?peptide_count .
        ?protein view:sequence ?sequence .
        }
        ORDER BY DESC(?spectral_count)
jaclynsaunders commented 2 years ago

@kaimikacolin - It's the FILTER REGEX line. I'll look into whether there's a way to expedite this. Virtuoso is slower than some other engines with regex. tbd

ashepherd commented 2 years ago

specifying the properties you want to search over will speed it up:

SELECT DISTINCT ?protein ?product_name ?dataset ?dataset_name ?expedition ?spectral_count ?protein_id ?peptide_count ?sequence ?ids
        WHERE { 
        VALUES ?p { view:productName view:identifiers view:expedition view:proteinID view:sequence }
        ?protein a <http://schema.oceanproteinportal.org/v2/views/ProteinFound> .
        ?protein ?p ?o .
        FILTER REGEX(?o, "AA", "i")

        ?protein view:productName ?product_name .
        ?protein view:identifiers ?ids .
        ?protein view:datasetName ?dataset_name .
        ?protein view:dataset ?dataset .
        ?protein view:expedition ?expedition .
        ?protein view:spectralCount ?spectral_count .
        ?protein view:proteinID ?protein_id .
        ?protein view:peptideCount ?peptide_count .
        ?protein view:sequence ?sequence .
        }
        ORDER BY DESC(?spectral_count)

view:sequence in the filter will be expensive

jaclynsaunders commented 2 years ago

@kaimikacolin - is this still a lingering issue?

kaimikacolin commented 2 years ago

This trimmed the time down to about 15 seconds, that's great. Let me know if you have any other optimizations!

@ashepherd @jaclynsaunders

jaclynsaunders commented 2 years ago

Low priority -- continued optimization perhaps, but working at an ok level. Maybe try to speed up search when have bandwidth

jaclynsaunders commented 2 years ago

@kaimikacolin - I restructured this string query search. Instead of searching through every single field, it will now only search a regex on the protein product name as well as any of the additional ids (NCBI taxon, pfam, uniprot, etc.). It searches all of the ids, so both the id name as well as the descriptions. Will just need to change the hard coded string "AA" to some input variable.

SELECT DISTINCT ?protein ?product_name ?dataset ?dataset_name ?expedition ?spectral_count ?protein_id ?peptide_count ?sequence ?ids
        WHERE { 
        VALUES ?p { view:productName view:identifiers view:expedition view:proteinID view:sequence }
        ?protein a <http://schema.oceanproteinportal.org/v2/views/ProteinFound> .
        {
        { ?protein view:productName ?product_name .
         FILTER REGEX(?product_name, "AA", "i") }
        UNION
        { ?protein view:identifiers ?ids . 
         FILTER REGEX(?ids, "AA", "i") }
        }
        ?protein view:datasetName ?dataset_name .
        ?protein view:dataset ?dataset .
        ?protein view:expedition ?expedition .
        ?protein view:spectralCount ?spectral_count .
        ?protein view:proteinID ?protein_id .
        ?protein view:peptideCount ?peptide_count .
        ?protein view:sequence ?sequence .
        }
        ORDER BY DESC(?spectral_count)
jaclynsaunders commented 1 year ago

@ashepherd sped up these queries significantly, so I'm closing this issue