Closed kaimikacolin closed 1 year ago
@kaimikacolin - It's the FILTER REGEX line. I'll look into whether there's a way to expedite this. Virtuoso is slower than some other engines with regex. tbd
specifying the properties you want to search over will speed it up:
SELECT DISTINCT ?protein ?product_name ?dataset ?dataset_name ?expedition ?spectral_count ?protein_id ?peptide_count ?sequence ?ids
WHERE {
VALUES ?p { view:productName view:identifiers view:expedition view:proteinID view:sequence }
?protein a <http://schema.oceanproteinportal.org/v2/views/ProteinFound> .
?protein ?p ?o .
FILTER REGEX(?o, "AA", "i")
?protein view:productName ?product_name .
?protein view:identifiers ?ids .
?protein view:datasetName ?dataset_name .
?protein view:dataset ?dataset .
?protein view:expedition ?expedition .
?protein view:spectralCount ?spectral_count .
?protein view:proteinID ?protein_id .
?protein view:peptideCount ?peptide_count .
?protein view:sequence ?sequence .
}
ORDER BY DESC(?spectral_count)
view:sequence
in the filter will be expensive
@kaimikacolin - is this still a lingering issue?
This trimmed the time down to about 15 seconds, that's great. Let me know if you have any other optimizations!
@ashepherd @jaclynsaunders
Low priority -- continued optimization perhaps, but working at an ok level. Maybe try to speed up search when have bandwidth
@kaimikacolin - I restructured this string query search. Instead of searching through every single field, it will now only search a regex on the protein product name as well as any of the additional ids (NCBI taxon, pfam, uniprot, etc.). It searches all of the ids, so both the id name as well as the descriptions. Will just need to change the hard coded string "AA" to some input variable.
SELECT DISTINCT ?protein ?product_name ?dataset ?dataset_name ?expedition ?spectral_count ?protein_id ?peptide_count ?sequence ?ids
WHERE {
VALUES ?p { view:productName view:identifiers view:expedition view:proteinID view:sequence }
?protein a <http://schema.oceanproteinportal.org/v2/views/ProteinFound> .
{
{ ?protein view:productName ?product_name .
FILTER REGEX(?product_name, "AA", "i") }
UNION
{ ?protein view:identifiers ?ids .
FILTER REGEX(?ids, "AA", "i") }
}
?protein view:datasetName ?dataset_name .
?protein view:dataset ?dataset .
?protein view:expedition ?expedition .
?protein view:spectralCount ?spectral_count .
?protein view:proteinID ?protein_id .
?protein view:peptideCount ?peptide_count .
?protein view:sequence ?sequence .
}
ORDER BY DESC(?spectral_count)
@ashepherd sped up these queries significantly, so I'm closing this issue
This query is taking over 20 seconds to return. Is there any way to optimize this?