Creating hindi corpus - Githubissues

unhammer / apertium-en-hi

These are the linguistic data for the Apertium English-Hindi machine translator.

2 stars 4 forks source link

Creating hindi corpus #6

Open NikantVohra opened 11 years ago

NikantVohra commented 11 years ago

@darthxaher I am trying to create a hindi corpus by crawling wiki in order to get better idea of coverage of dictionaries .Do you have some script to do the same?

azmfaridee commented 11 years ago

@NikantVohra I'll have to look it up if I still have that. I'd let you know. Meanwhile, have you talked to @ftyers regarding this, i.e. if he still has one?

I'd be more than happy to write one for you if both options are negative. :)

NikantVohra commented 11 years ago

ftyers has provided me with two scripts ...I will try them and get back if I need any help...:)

NikantVohra commented 11 years ago

Hey I am not able to extract wiki data using these scripts:

http://pastebin.com/ugUYNfC2 http://pastebin.com/LwhJwCnu can you help with that?

ftyers commented 11 years ago

El dj 04 de 07 de 2013 a les 01:09 -0700, en/na NikantVohra va escriure:

Hey I am not able to extract wiki data using these scripts: http://pastebin.com/ugUYNfC2 http://pastebin.com/LwhJwCnu

can you help with that?

Can you at least say what you tried ?

azmfaridee commented 11 years ago

Are you trying to write/use a web crawler to download pages from wikipedia or are you trying to extract words from already downloaded pages?

The two scripts that Fran gave you will help to extract words from already downloaded pages, but if you need a script to download the pages in the first place, these are not the things you are looking for.

You can download the pages with Curl or any web crawling framework like scrappy (http://doc.scrapy.org/en/latest/intro/tutorial.html). Even writing a very rudimentary level crawler is quite easy with basic python.

If that is the case, I'd try to write you one tomorrow :) I've got my hands full with lot of stuffs, so have a little bit of patience in that case.

NikantVohra commented 11 years ago

Thanks @darthxaher . But it is fine now . Fran gave me the link to the wiki dumps for hindi wiki pages so I do not need to implement the web crawler. I can just extract the data from the dump and use the scripts to form the corpus.I would report back to you once I get the coverage for the dictionaries.

NikantVohra commented 11 years ago

Hey here are my results for the corpus attained from wiki:

http://wiki.apertium.org/wiki/Hindi_and_English/Results

The results are similar for the morphological analyser as the previous corpus but the bilingual dictionary gives a fall of translation accuracy by about 4% .

saggy123 commented 10 years ago

hi nikant , can u please share the hindi corpus as we require it very urgently.

ayushmi commented 9 years ago

hi! @NikantVohra can you please share the hindi corpus.