Open sehsanm opened 5 years ago
recently I found "sketchengine" as a tool for making corpus from web. it is explained how to use it in this tutorial : https://www.sketchengine.eu/quick-start-guide/create-your-corpus-lesson-4/ . I have maked sample corpus from Hamshahri news by this tool which I'll attach it here. ham2_2(1).txt
and also we can just find corpus like "Hamshahri" or "irBlog".
Nice tool. I know that someone already has collected the Persian news corpus in our NLP lab. Also note that we should be seeking more than 10 milion sentences. If this tool is able to crawl all of that. Lets build a fresh copy.
it has 1,000,000 words limitation .
If you are starting to work on this please move it to in progress
Any progress in this @PoriNiki ?
yes. I trying to get the corpus from "Kanal e Khabar".
I'm recently talking with them but, this could take time.
سلام تا انجام شدن کامل این تسک فقط یک Readme.md فاصله داریم
سلام. از طرف من فعلا دیتای مذکور کنسل شد من از این مورد کنار میرم.