Scripts for scraping dsal

vvasuki commented 7 years ago

namaste @damooo

could you share the scripts you used to scrape tamil-lexicon off dsal? hopefully we can reuse that for other dictionaries there?

funderburkjim commented 7 years ago

Tamil Lexicon caught my eye. Thomas Malten digitized this long ago, and a simple display is http://tamildictionaries.uni-koeln.de/

If the underlying data would be of use to someone, I'm sure Thomas, though retired, would want to make it available with the usual Creative Commons type license.

vvasuki commented 7 years ago

Nice! Thanks for the info, @funderburkjim .

gasyoun commented 7 years ago

@funderburkjim yeah, let's ask Thomas. In 2006 he was against. But time have changed.

damooo commented 7 years ago

sorry for late, every dictionary of dsal doesn't fallow same pattern, hense i don't have any script. but every dict can be scraped with this common procedure.

goto this dictionary list page http://dsal.uchicago.edu/dictionaries/list.html , and then select a dictionary, and go to it's home page. like http://dsal.uchicago.edu/dictionaries/fabricius/ is one tamil-english dict i use here as example.
in it's home page, enter a search word which you know surely exists in that dictionary as headword. or just enter example entry they gave after search box. and search. i searched அகத்தியன் in that dict page. and result page is http://dsalsrv02.uchicago.edu/cgi-bin/app/fabricius_query.py?qs=%E0%AE%85%E0%AE%95%E0%AE%A4%E0%AF%8D%E0%AE%A4%E0%AE%BF%E0%AE%AF%E0%AE%A9%E0%AF%8D
in results page, there will be a result word, and besides that 'a page number hyperlink' which it resides in. click on that page number hyperlink. that 'dictionary page' hyperlinks for above dict are in format of http://dsalsrv02.uchicago.edu/cgi-bin/app/fabricius_query.py?page=2 . it will open a page of that dictionary, and it contains all words in that dictionary. and links to next page, previous pages also will be there. So if we scrape all these pages, we can get all words. but those results are not "formatted,structured' . and hence lacks all possibilities.
So now again, we can see 'main head words' in above 'dictionary page' are again hyperlinked to their own 'webpage'. and if we click on any of headword hyperlinks, we will goto it's 'webpage', the 'webpage' of individual headwords are in above 'dictionary page' are in format of http://dsalsrv02.uchicago.edu/cgi-bin/app/fabricius_query.py?qs=அகத்தியன்&searchhws=yes and there the description is 'formatted,structured' unlike in 'dictionary page'.

So idea is, first scrape all 'dictionary pages'; which can be get with wget using regex according to pattern, as page numbers are known, upperlimit we can check and decide by checking. And then collect all 'main headword hyperlinks' from those 'dict pages' through simple filtering one line command. they all contain headwords in their links, so not possible to give regex to wget as they have no perticular pattern, and confirmation. so collecting like this will solve that ; Now scrape that collection of hyperlinks. then sort,and then make each headword,description to a line in babylon file, and then sort and remove duplicate results, then after can do what ever we want with that.

If possible will create script for this process..

vvasuki commented 7 years ago

Thanks, daamoo! That's a very clear description.

If possible will create script for this process..

I know this might be a stressful time for you, but if you can do that when feasible, it would be a great help and will let us expand in many underserved languages of interest! I for one would love the urdu dictionary just to understand what the northies are saying. A suggestion: As it will help you in the future, take this as an opportunity to learn python - do the entire think in python.

damooo commented 7 years ago

There are two types of dsal dicts. Old type, new type. they are gradually converting old type format to new type format for all dictionaries. https://github.com/sanskrit-coders/stardict-tamil/blob/master/ta-head/fabricius/dsal_scrape.sh this script scrapes, and creates simple babylon file for new type format dictionaries. for now only tamil,telugu,sanskrit dicts are in this format(may be others too). others are still in old format.. to scrape old format i will add script soon. if webpage http://dsalsrv02.uchicago.edu/cgi-bin/app/DICTIONARYNAME_query.py?page=1 exists and has content other than template, then it is new type. DICTIONARYNAME is name of that dictionary in it's homepage url.

damooo commented 7 years ago

if have internet one can scrape and make dict of http://dsal.uchicago.edu/dictionaries/kadirvelu/
UpperLimit page number is 1698 for it, if script asks.

vvasuki commented 7 years ago

to new type format for all dictionaries. https://github.com/sanskrit-coders/stardict-tamil/blob/master/ta-head/fabricius/dsal_scrape.sh

Thanks for adding fabricus and the new script (moved the script to https://github.com/sanskrit-coders/stardict-tamil/blob/master/bin/dsal_scrape.sh )!

to scrape old format i will add script soon.

Super! Good to make your script so that it can easily distinguish between the two..

vvasuki commented 7 years ago

if have internet one can scrape and make dict of http://dsal.uchicago.edu/dictionaries/kadirvelu/ UpperLimit page number is 1698 for it, if script asks.

I get errors like:

/cgi-bin/app/kadirvelu_query.py?qs=ஊசித்துளை&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊடலைத்தீர்த்தல்&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊரிலுள்ளார் உண்ணும்நீர்&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊர்க்குருவி&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊர்சூழ்சோலை&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊழிக்காலம்&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊறுபுனல்&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊற்றுநீர்&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=ஊன்விற்போர்&searchhws=yes: Scheme missing.
/cgi-bin/app/kadirvelu_query.py?qs=எடுத்தலோசைவகை&searchhws=yes: Scheme missing.
parallel: Error: Command line too long (69768 >= 65524) at input 0: 35) <a href="/cgi-bin/app/kadirvelu_query.py?qs=��...
mv: cannot stat '*': No such file or directory
sed: can't read dPada_1.htm: No such file or directory
sed: can't read *.htm: No such file or directory
/home/vvasuki/stardict-tamil/ta-head/kadirvelu

gasyoun commented 7 years ago

they are gradually converting old type format to new type format for all dictionaries.

Have not seen any changes recently.

And I've scraped all visible and invisible dictionaries myself a few times. The worst part - some pages are missing in the original as well.

damooo commented 7 years ago

I get errors like: <

@vvasuki Updated tested script. now working. @gasyoun they changed telugu, tamil dicts from old format to new before 3 months. hense i thought they may changing all.

vvasuki commented 7 years ago

Now I get stuff like:


http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=632:
2017-04-25 07:10:21 ERROR 404: Not Found.
http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=633:
2017-04-25 07:10:21 ERROR 404: Not Found.
http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=634:
2017-04-25 07:10:21 ERROR 404: Not Found.
http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=635:
2017-04-25 07:10:21 ERROR 404: Not Found.
http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=636:
2017-04-25 07:10:21 ERROR 404: Not Found.
http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=637:
2017-04-25 07:10:21 ERROR 404: Not Found.
http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=638:
2017-04-25 07:10:21 ERROR 404: Not Found.
http://dsalsrv02.uchicago.edu/cgi-bin/app/_query.py?page=639:
2017-04-25 07:10:21 ERROR 404: Not Found.

I think there must be something wrong in the urls above. There is nothing to identify the dictionary there.

gasyoun commented 7 years ago

to new before 3 months

Oh, ok. We can ask, I was exchanging emails with them in 2007.

vvasuki commented 7 years ago

Will eventually take this up in https://github.com/sanskrit-coders/dsal-scraper .

indic-dict / stardict-sanskrit

Scripts for scraping dsal #35