serebrov / get-med

Scrapping medical articles (html/pdf) with mechanize, save and translate with google translate
0 stars 0 forks source link

How to use #1

Closed zaixi closed 6 years ago

zaixi commented 6 years ago

I also recently need to translate pdf, found the repository, can you tell me how to use it

serebrov commented 6 years ago

These scripts were written around 5 years ago and google translate UI has changed, so they are not fully functional. I did some review and cleanup, the translation by url still works, so as the gethtml.py script. What it does is: download the HTML, save it, pass the html file URL to google translate (for example, http://translate.google.com/translate?hl=en&sl=auto&tl=ru&u=https://math.stackexchange.com/questions/2093425/equation-of-a-plane-passing-through-intersection-of-two-planes-and-parallel-to-a), download and save the translated page. The "core" translation feature is this function: https://github.com/serebrov/get-med/blob/4d96dca6ce211000d33e940788d6a10c57261700/browser.py#L110-L120

For PDFs I've did this: download PDF, convert it to html, translate html via google translate (the getpdf.py) - this doesn't work now because I was using the form on the google translate page and it now works differently than before. But now the translation by URL also works for PDFs (for example http://translate.google.com/translate?hl=en&sl=auto&tl=ru&u=http://pages.mtu.edu/~fmorriso/MathType-tipstricks-full.pdf), so you can quite easily adopt the approach used for htmls (or even use the gethtml.py script directly).

Note: I am not 100% sure, but I think the automated usage of google translate may violate Google TOS. It might be OK to translate few files for your personal use in a way I did here, but you shouldn't use this approach in the commercial software, instead use the translation API: https://cloud.google.com/translate/docs/

zaixi commented 6 years ago

Thanks, this can already help me, I just translate a few PDFs personally.