hacioguz / solr-php-client

Automatically exported from code.google.com/p/solr-php-client
Other
0 stars 0 forks source link

Feature request: avoid using file_get_contents() #49

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Solr PHP Client uses file_get_contents(), which requires the allow_url_fopen to 
be enabled; while the use here seems to be quite safe, it is often recommended 
that the feature be entirely disabled so that there is no way the system can be 
tricked into using include() or require() to load content from a foreign URL.

If you would consider including a patch that changes the system to use curl if 
the extension is enabled, and fall back to the current behaviour if it is not, 
then I am happy to write the patch for you; I just want to know that you're 
happy to include the patch in principle (subject to whatever changes you 
require).

Original issue reported on code.google.com by liam.obo...@gmail.com on 9 Sep 2010 at 1:25

GoogleCodeExporter commented 9 years ago
This has been discussed somewhat before, I believe on the google group. I'm not 
against a patch like that, but I might not immediately integrate it. To do this 
right, the Service and Response classes really need to be refactored so that 
the way in which requests are sent to Solr is pluggable. There'd be either an 
interface or abstract class that'd define the behavior, but the implementation 
would be free to use whatever means deemed appropriate: file_get_contents, 
curl, Zend_Http, etc.

Feel free to submit any patch, it only helps even if its not immediately 
integrated - and if you take a shot at something like I laid out then that's 
even more awesome :)

Original comment by donovan....@gmail.com on 9 Sep 2010 at 2:15

GoogleCodeExporter commented 9 years ago
Sounds workable; I'll try to get onto something like you suggested some time 
soon.

Original comment by liam.obo...@gmail.com on 16 Sep 2010 at 12:16

GoogleCodeExporter commented 9 years ago
This suddenly sprang into urgency this weekend when I needed to rebuild a 
search index with a few GB of documents in it; it appears that there is a 
rather significant memory leak in the php stream context code, such that each 
time you assign the contents to the http request that memory is not freed until 
the process dies.  I tried unsetting the options and the context, but it didn't 
free the memory.

This is probably not a problem for an individual indexing request, but with a 
daemon or a bulk indexing job it's a critical problem that makes the current 
approach unusable.  As such, I've rolled an emergency patch so that I could 
rebuild our index, but it purely replaces the current approach with curl, 
through changes in the Service and Response classes.  If anyone else comes 
across the need for this, I can add a patch here.

Hopefully I'll get on to a proper solution along the lines which Donovan has 
suggested above later this week.

The following bugs refer to leaks with stream_context_create, but I suspect 
it's a similar problem with the leak I have found.
* http://bugs.php.net/bug.php?id=50111
* http://bugs.php.net/bug.php?id=40257

Donovan, I was thinking that my implementation would be something like
* Create a factory class which the Service will use to create a Request object 
(which submits requests to the service) and Response objects (which will 
process the response).
* The factory can be injected into the Service; if none is set, it will 
instantiate a default factory object which will create Requests and Responses 
that do the current behaviour; this way the system will continue to behave as 
it already does for backwards compatibility.
* I will also implement a factory which uses Curl instead, which will create 
Requests and Responses that use Curl instead and handle the Curl format 
response headers appropriately.

Original comment by liam.obo...@gmail.com on 3 Oct 2010 at 4:19

GoogleCodeExporter commented 9 years ago
What revision of the code are you using? In issue #20 / r21 I changed the code 
to not create a new context for every request and instead to reuse 1 context 
for get and 1 context for post. Are you using a single Apache_Solr_Service 
instance or many throughout the indexing process? 

Also, Timo Schmidt has given me a patch by email for switching out http 
backends, which is why I created the http_requests branch to integrate it 
without effecting trunk until I was satisfied with it. I'll get those changes 
in asap so you don't have to duplicate effort there.

Original comment by donovan....@gmail.com on 3 Oct 2010 at 5:05

GoogleCodeExporter commented 9 years ago
I am using the one that only uses a single context, but the way that I am 
calling it is inefficient and creates a new service for each request (a problem 
I do need to address).  I was trying to unset the option which has the document 
content in it without success, but trying again today I am able to do so; the 
problem was that I was losing an amount of memory equal to the document size 
each request, not just the size of the stream context, so things went down hill 
very quickly!

Anyhow, it is possible to unset the data if you set the option correctly, so 
this is not so dire as I thought :) I'm still keen to roll the curl patch; 
should I make it against the http_requests branch instead then?

Original comment by liam.obo...@gmail.com on 3 Oct 2010 at 11:51

GoogleCodeExporter commented 9 years ago
Hmm, running a diff between that branch and the trunk shows no differences, 
e.g. 

svn diff http://solr-php-client.googlecode.com/svn/trunk/ 
http://solr-php-client.googlecode.com/svn/branches/http_requests

returns nothing.

Original comment by liam.obo...@gmail.com on 3 Oct 2010 at 11:43

GoogleCodeExporter commented 9 years ago
I have not committed the differences yet. I will let you know here when I do.

Original comment by donovan....@gmail.com on 4 Oct 2010 at 2:50

GoogleCodeExporter commented 9 years ago
Here it is: r40  

Original comment by donovan....@gmail.com on 8 Oct 2010 at 1:46

GoogleCodeExporter commented 9 years ago
The http_requests branch has been fully merged to trunk in r49. Thanks everyone.

Original comment by donovan....@gmail.com on 21 Oct 2010 at 3:26