tripal / tripal_blast

Provides an user interface to BLAST on Tripal sites.
https://tripal.github.io/tripal_blast/
5 stars 14 forks source link

WSOD for large XML file #12

Open laceysanderson opened 7 years ago

laceysanderson commented 7 years ago

There is a known problem that if a user has a large blast result XML file (exact size depends on your web server) they might end with a memory exhausted WSOD. This is due to the XML reader we use reading the entire file into memory.

We have mostly mitigated this issue by checking the number of hits using grep and then only reading the XML if it is below a static threshold (currently 500). However, there appear to be some edge-cases that this still happens in and of course, if your particular server has issues with even 500 hits then this will still be a problem.

Furthermore, since we simply just don't read the XML we can't even show you a subset of hits or a summary which would be more useful in the case of large resultsets then simply forcing the user to download TSV and go from there.

In the future I would like to move to a stream-based XML reader (http://php.net/manual/en/class.xmlreader.php) to completely remove this issue.

bradfordcondon commented 6 years ago

Would this result in a (HTTP error code) 500 error?

EDIT (googles white screen of death.... yep, OK, thats our problem)

The user im troubleshooting with will get an HTTP 500 error and not be able to display the job page if he sets the max hit limit to 250. (Setting it to 500 and your failsafe kicks in).

Setting to 100, and the page loads.

We can check the logs and see that the blast job ran successfully in all cases.

So it sounds like for HWG implementing your module, 250 is somewhere between your fix and what our server can manage. I can change the code on our local branch to cut off at 101 instead of at 500.

ekcannon commented 6 years ago

Possible solutions:

  1. As suggested, use a streaming XML parser like XMLReader. But this requires a PHP package that isn't typically included in a standard PHP installation. The number of results displayed will still need to be limited, and it isn't simply a matter of limiting the number of hits as a single hit can contain many hsps.

  2. Limit the number of hits when the XML is created from the ASN result file but save the full results in the other file formats (HTML, tab, GFF) . This can be done with a parameter to blast_formatter. But this only limits the hits; there can still be lots and lots of hsps for each hit.

  3. Since HSPs can be limited by the BLAST command, add an hsp limit to the configuration options. But this prevents people from searching with repetitive sequence, for example, to look at the distribution of a particular repeat sequence.

  4. Detect if there are too many results after running BLAST and converting ASN into HTML, tab, and GFF outputs. If too many, run BLAST again with hsps limited to: (# hits)/(total # HSPs) and generate the XML from those results. But this greatly slows down the job.

Reactions? Any other ideas?

laceysanderson commented 6 years ago

Hmmmmm.... my gut reaction is that #4 is the safest but I worry since BLAST is a heuristic we would end up with results shown on the page that are not in the downloadable files :-(

ekcannon commented 6 years ago

I have an operational proof-of-concept for option 4, but the concern about getting different results from the second BLAST execution is very valid ... and quit likely to happen. Unfortunately, blast_formatter won't permit limiting hsps like the BLAST programs do.

laceysanderson commented 6 years ago

What about if we do a combination of #2 & #3? A. create an advanced option for the blast form that limits HSPs and set its default to something reasonable. That way if they need to do repeat analysis they can change it but most people should not. B. When creating the XML from the ASN we limit the number of hits.

A takes care of the HSPs and B takes care of the hits keeping us in a safe zone and ensuring we have the same results on the page as in the files. There is still a little bit of rope for people to hang themselves but it is MUCH better then where we are currently at. Thoughts?

ekcannon commented 6 years ago

Hmm. I'll play with this idea for a bit.

laceysanderson commented 6 years ago

Ok, thanks for taking this one on and Yay for coffee and a fresh morning ;-)

smriti-135 commented 4 months ago

Hello. I tried blast recently with a huge file and am getting the same problem. Sample given below Query:

CM030740.1 Citrus sinensis isolate HZAU_DHSO_2021 chromosome 1, whole genome shotgun sequence AATTTTACCATTAAACATTTAGTTAATTGGAATATGAAGTTTAGGACCGCCAACTTCATATTGTAATGCT CTAGTTAATAATTAAAATTTTATTAAGTTAGTGATTGTAGATTTCATGTTAACATTAGAAGGGTGGAATT ATTTTAGGTTCAATAATAATGGTTTCATGTTATAGTCAATGATTTTAGGTTAAACATTTAGAATATTTTT AGTTAATGGTCTGTAAATATACGGATGTGATGTTTCGGGTTCAGTGTCGTACTCAAAGTAAGACTGTAAT ATACTTTGTTCTCTAGTATATAGTGATTATACATATATCAATGCATACTAAGTTAAACTCTATTAAAAAA TGCATGGAACAAGCTCGTGGATGGTGCAAGGTTGTATTGCATTAAATTACTTTTCATGTTACATAAATAT AATATTGTAGTCGACGTTATTATCCTATATTATAATAACTAATAAAAATGATTATTCAGTTTCTTATAAT ATAATAAGCTTAAACTATATAACAACATATTGTAATATTTAATTATTTATTAATGGTTTAGTATTACGTT AAAACATTTCGCAAGCCAATAGCCTAATATAACTATTTGTAGGTTTAATTAAAAAATGAATAACAAACCC CGCAAATATTAATGCATCTAAAATAGTATGAATTTAATTTGAACGGTATTATATAAACTATAAATATAGC ATAAAATTTAATTTGTTAAATAATACAGTTCATAATTATTCGAAAACAATGATATATCATAATACAATTC AAGATAATGCAATATAGCCAAAAAAATTATGTTAATAGTCCACAGTGCTTACAATTGAATATGTTGTAAT GTTAATTGTAACTGTCTCATTTTTATTTTTTTAATGATTAAATACAATAACATAAAATTAATATATTTAA AAATGGATATTATATTTCAATAATGATTTCAAATCTATTTGTTACTATGAAAATAATTAATGGACTATGA ATTGTAGGAATTGAATTACATAAAATATAGTTACTCTAATTACTATCTTATTCTAAAAAACAAAATTAGA ATTATTTTTCTTTAAAAAATGCTATAACTTTAAAATCCAGCTTATTTGCCAAAATCTTCCCATTTTTTGG TTCGCCCTCCACAAAAGGGCTTCAAGCCTGTTGATTAGGTTGCCCAACATTCTAAACAAAACAATTCAGC CGTTGTTCCATCCATTATCACTTCTGGACATATTAAATATCATCCATTAGATAAGATTTTAACATATTCA GTGGTTGAGATTTTGTTGACTGTAAGTTACATTTAACGTAACTTAGGGATTGCCTACAAAAATGAAGCTA

Target:

scaffold00001 length=5927163 TTTTGTATTCTATGTCCTCTGATCTTTATACTTCTTCATTTTGTCTTTGCAAGAACCGGA ATTATGGGTACATCACAAATTCTCTAGGTGTGACTTGTGTTGTGGGGCCTTTTTTTtACA TTTCCATATTGCAAGTATTTTTTTGCTACCATTGGTATATTTGTCTGTTAAAATCAATCT GCTTTCACTTATGTTCGTGCGTTCTTGTTCCCTCGCCTTGCAATTGCATATCTCAAATTA TCTTTCTTACTTTGATTTAGATGGCCAAGGTTTTAAGCTAACTTTTTACAATGCCAATTT TTAAATGGTTTTCTAATGCTGTTCAAAGTTGCAGCCTTTACTTCGTATATTTGTCAGGTT CTGACGGGTGCGGTCGGCGGCGGGGGCTATAGCATGCGGTCTCGAGAGCCGCAAAGAAAA ATGGGTGGTTTTCCCGGTTTCGGCCATAACTCGTGATCGGGGCCTCCGATTCTGGTTCCG TTTCGTCCCACGGGACCAGCCGGGCGGGGGCATCGGATTGCAAAAGTCTTTAAATTTGAA TTTGATTTAAGTTTATATAGTTTGAACACAAAAACTAGCCATTACGGACAAAAACAACAA ATAGTCGGCTAGCCTATTAATTAGCCAGATCGCCTCTTAATACAGTGCAAGTTACCGTTG CAATTTGAATTTTGCTGCAGTGATGCTATAGTAACACTATTTTTtAAAATTTCATTGTTA CCTAAAACTTTTTTATAATTTGACTATGACCCAAAATGTCATAAAATTTTGCAAATATAT CAAATTTCAGAATTTCTAAATAATGCGCGTTATTCTTAAAACTTTTTGAAATTATGCTAT GGCCTAAAACTTTATAATATTTTTCAAAGAGATTCTTCTCAGAATTTTAACATAATGCTT ATTTATTTCAAAGTTCCCAAAATCTTTTTCAGTTTAATCCAAACTTTGAAAAACACTCAA ATCCTCAAAATACTCGTCTTATAAATATAAAATCTTTTTGTTTATAAAAAGTAATGATTT ATTAAATAAAATCTTGAGCTTTTTCAATGCTAAACTATACATATATCAAATCATACTGGC TTTATAAGAATTTGTTGCAATAATGACTCCGCAGAGCTAAACTTTGCTCTTGATCAAGCC

I am either getting a memory error such as

Fatal error: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 324671885 bytes) in /var/www/html/teak-wood-genes-drupal7-test/includes/database/database.inc on line 2284 Fatal error: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 23072768 bytes) in Unknown on line 0 Warning: Unknown: Cannot call session save handler in a recursive manner in Unknown on line 0 Fatal error: Allowed memory size of 2147483648 bytes exhausted (tried to allocate 98570240 bytes) in /var/www/html/teak-wood-genes-drupal7-test/includes/bootstrap.inc on line 3876

or a "gatekeeping timeout" error. But the files are getting generated and stored... I was asked to contact the developers/maintainers to see what can be done. I am using Tripal v3.100.

Edit 1: I tried blast in blast website using the same files and got this error "Length limit exceeded. Please reduce your query/subject sequence length to 10,000,000 letters or less." So is my file too large? I still need to use that file so how to I configure so?

Edit 2: When I tried blasting in the web ui using a very small query size, I am getting some results but I am getting more errors (hundreds which are basically repetitions of the same few)

Deprecated function: imagefilledpolygon(): Using the $num_points parameter is deprecated in generate_blast_hit_image() (line 654 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc). Deprecated function: imagefilledpolygon(): Using the $num_points parameter is deprecated in generate_blast_hit_image() (line 654 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc). Deprecated function: Implicit conversion from float 299.4699930400503 to int loses precision in generate_blast_hit_image() (line 671 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc). Deprecated function: Implicit conversion from float 50.219298245614034 to int loses precision in generate_blast_hit_image() (line 672 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc). Deprecated function: Implicit conversion from float 50.219298245614034 to int loses precision in generate_blast_hit_image() (line 696 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/api/blast_ui.api.inc). Warning: A non-numeric value encountered in include() (line 92 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/theme/blast_report_alignment_row.tpl.php). Warning: A non-numeric value encountered in include() (line 106 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/theme/blast_report_alignment_row.tpl.php). Warning: A non-numeric value encountered in include() (line 111 of /var/www/html/teak-wood-genes-drupal7-test/sites/all/modules/tripal_blast/theme/blast_report_alignment_row.tpl.php).