rudolphi / open_enventory

PHP/MySQL-based chemical inventory/Electronic Lab Notebook for chemistry
https://sourceforge.net/projects/enventory/
GNU Affero General Public License v3.0
11 stars 6 forks source link

ChemExper Blocking #13

Open lcnittl opened 5 years ago

lcnittl commented 5 years ago

Is it possible that ChemExper blocks an IP that sends too many requests? We were running an inventory Batch processing with Read data from suppliers. The first few entries go fine, then requests sent to ChemExper give timeouts. To probe whether ChemExper was down or not, we cURLed from another host - no problem reaching it. After waiting some hours the blocking seems to be reset.

I guess there is no possible workaround for this? And did I deduce correctly, that structures are fetched from ChemExper (at least no structures were generated if we deactivated the use of ChemExper by setting $GLOBALS["suppliers"]["acros"]["alwaysProcDetail"] to false.

https://github.com/rudolphi/open_enventory/blob/61983563e7c916f00db197ecc06a058d47fa4241/suppliers/Acros.php#L29-L38

lcnittl commented 5 years ago

A short update: Some structures are still found without ChemExper - seems to be coincidental that the first 150 were not :) Yet, having also realized, ChemExper is still being consulted (having timeouts).

rudolphi commented 5 years ago

When reading log files, I often had the impression that ChemExper blocks if repeated access is detected. However, I hoped that the other suppliers are still enough in such case.

MOLfiles can be loaded from Acros, Cactus, chemicalbook, Fluorochem, NIST and Pubchem, so there are multiple sources.

MSDS can be loaded from Acros, Activate, Alfa, Apollo, Biosolve, carbolution, Carl Roth, Cayman, Fisher, Fluorochem, ITW/Applichem, Merck, Oakwood and Strem. The blockings are a bit sad as Acros has many substances, good quality data, MOLfiles and MSDS...

In a different case of IP address blockings, proxy services like http://anonymouse.org/cgi-bin/anon-www_de.cgi/http://sciformation.com may help, but I am not sure if we should get into this.

lcnittl commented 5 years ago

Thanks for your answer! I got the same impression, the first few request go fine, then blocking starts.

Indeed - they are sufficient for most of the molecules. Yet a downside, however, is the long time it takes when being blocked, as several hundreds (or thousands) of timeouts do sum up after all.

I tried to deactivate Acros in the Internet data retrieval tab from the global setting, but seemingly without success. Is there an easy way to deactivate Acros temporarily (removing the php file)?

As a side question: We just see the following suppliers in our global settings (the files are in place):

Is there a setting we are overseeing to also have the others in the list?

khoivan88 commented 5 years ago

I had the same experience with ChemExper temporarily blocked after several attempts as well. As Felix said, there are also many other sources for structure and SDS. @lcnittl : the deactivation inside OE in Global settings only removes it from being accessed during Search Chemical in Supplier mode. I believe it does not stop OE from accessing those suppliers during import from tab-separated text file, as in your case? For this issue, i think there is 2 things you can do:

  1. Reduce the set_time_limit in import.php on line ~323 (the line # might not be absolutely correct because there might be modification, I posted the snippet below). This will reduce the amount of wait time for nonresponding suppliers. In my experience, if a supplier works, it would take less than 30s, I used to set this setting to 60

https://github.com/rudolphi/open_enventory/blob/61983563e7c916f00db197ecc06a058d47fa4241/import.php#L309-L325

  1. In lib_supplier_scraping.php in function getAddInfo(), you can do: a) Change set_time_limit on line ~168 to shorter, again, not sure if this is redundant. b) Right before the foreach statement after the set_time_limit on line ~168, you can add something like this:
    // Khoi: removing Sigma and Acros because the scrapping scripts for these 2 site do not work and just take time
    unset($addInfo[1]);  // removing Acros
    // unset($addInfo[4]);  // removing Sigma; update 2019-07-26, Sigma search is working on A2hosting server now
    // unset($addInfo[6]);  // removing chemicalBook

The $addInfo[x] array index number correlate to the suppliers you want can be found in the same file lib_supplier_scraping.php on line ~88, index starts from 0.

This has worked for me but @rudolphi can tell you the best way.

khoivan88 commented 5 years ago

@lcnittl : I also wrote a couple python scripts to scrape structures and SDS from the internet and add the info into OE as well. they basically look into your OE database of interest, find the molecule (CAS#) with missing structure or SDS and then proceed to scrape from the internet those info. You would need python on your hosting server and root (on the host server) access. If you are interested, please let me know and I can share those scripts with you.

lcnittl commented 5 years ago

@khoivan88 Thanks for your input. I think I will indeed go with option 2b.

Concerning the python scripts: If you are willing to give them away I would certainly not say no :)

khoivan88 commented 5 years ago

@lcnittl : Here is the link to my python script to search for missing structure. You can install the required packages in requirements.txt. You need to change to root user on your server first by running su in the terminal. You can then use the python file inside the update_sql_mol_v6 by running something like python3 update_sql_mol.py in the terminal. It will ask you if you are using root user (answer y) and then proceed to ask you the name of the database you want to affect. You will have to type in the name of the database twice (i designed it that way to make sure that the user is sure of what they want to do). After the program is done, you won't see the structure yet. You will have to log in into OE on the webpage as root user, go to Settings/Batch Processing, choose the database that you just run the python script on and check all of the following: "MOLECULE", "EMPIRICAL FORMULA", "MW", "DEG. OF UNSAT." , "STRUCTURE", AND "SMILES" and then let OE run to generate structure image. (I wrote this script a while ago and at that time i don't know how to include the generation of structure image in OE yet, I have an idea now on how to incorporate into the python script but I just do not have time yet to go back and add more to the python script.) Sorry for the inconvenience. See update note below. Let me know if you have any issue and I can walk you through it more.

https://github.com/khoivan88/update_sql_mol

I have another script to update SDS but I have not upload to github yet. I will do that and then give you the link later.

Update (2020-01-18): the newest version of this script should work without the extra manual Batch Processing step. I have updated instruction in the repo as well.

khoivan88 commented 5 years ago

@lcnittl : so this is the link to updating missing SDS. It runs very similar to the python script for update mol files. However, you just need to run this script and done, no 2nd step required. As usual, if there is any problem, please let me know. https://github.com/khoivan88/find_missing_sds-public

PS: I forgot to say that both of the python scripts are made for OE hosting on Linux (specifically CentOS 7), if you hosted it on a different system like Mac or Windows, you might want to change the download_path variable on both files to someplace else in your system!

lcnittl commented 5 years ago

@khoivan88 Thanks for the scripts - they are very much appreciated. I will have a look at them within the next days.

For the OS - no problem, we are running on Debian (containerized, so I will still have a look) :)