rafsanlab / ScrapPaper

A web scrapping method to extract journal information from PubMed and Google Scholar using Python.
Mozilla Public License 2.0
34 stars 5 forks source link

Small Feature Request: Ability to define output file name by the Keyword searches used. #2

Closed mindwellsolutions closed 1 year ago

mindwellsolutions commented 1 year ago

Thank you for developing such a great python script. This truly stands out over the other 10 I tried - finally a script that actually searches for direct keywords within the search results.

There are a couple small things I ran into, that if modified would add monumental additional power and abilities to the script. Currently, it's nearly impossible to complete running the script when its parsing any searches that have more than 49 pages results, due to Captcha. Or if I realize result pages start to have irrelevant studies and I abort, it breaks the csv output process.

image

1) A simple solution: A way to use a simple parameter input that pre-defines how many search pages to parse. Example to parse only the first 25 search pages: (-s = search pages) python scrappaper.py -s 25

2) Allow the script to receive a new URL that starts on for example the 50th results page. So i can switch VPN IP and continue from the point where it stopped due to Capcha. Currently it still restarts from the 1st results page even if the URL points to the 50th results page.

3) Have the script automatically name the output csv file with the keywords used in the google scholar/pubmed search with a date and time stamp. That way its clear which CSV file has which information and it doesn't keep overwriting the same output file.

One other suggestion, but less important:

4) To prevent losing the csv output if a user aborts the process. Create another parameter input (-w = write _to_file) that has the app write_to_file and append the csv file after each search page is completed. This will allow the user to abort the process if current pages have irrelevant data, without breaking the CSV output for the articles already parsed.

Thanks in advance, I'm sure many users would benefit greatly from these small updates.