sehsanm / embedding-benchmark

Word Embedding benchmark project By Shahid Beheshti University NLP Lab
GNU General Public License v3.0
6 stars 16 forks source link

Find and upload Persian News Corpus #1

Open sehsanm opened 5 years ago

sehsanm commented 5 years ago
abb4s commented 5 years ago

recently I found "sketchengine" as a tool for making corpus from web. it is explained how to use it in this tutorial : https://www.sketchengine.eu/quick-start-guide/create-your-corpus-lesson-4/ . I have maked sample corpus from Hamshahri news by this tool which I'll attach it here. ham2_2(1).txt

and also we can just find corpus like "Hamshahri" or "irBlog".

sehsanm commented 5 years ago

Nice tool. I know that someone already has collected the Persian news corpus in our NLP lab. Also note that we should be seeking more than 10 milion sentences. If this tool is able to crawl all of that. Lets build a fresh copy.

abb4s commented 5 years ago

it has 1,000,000 words limitation .

sehsanm commented 5 years ago

If you are starting to work on this please move it to in progress

sehsanm commented 5 years ago

Any progress in this @PoriNiki ?

FullDataAlchemist commented 5 years ago

yes. I trying to get the corpus from "Kanal e Khabar".

FullDataAlchemist commented 5 years ago

I'm recently talking with them but, this could take time.

sehsanm commented 5 years ago

سلام تا انجام شدن کامل این تسک فقط یک Readme.md فاصله داریم

FullDataAlchemist commented 5 years ago

سلام. از طرف من فعلا دیتای مذکور کنسل شد من از این مورد کنار میرم.