google / corpuscrawler

Crawler for linguistic corpora
Other
190 stars 56 forks source link

Improve readme documentation on how to provide a new crawler #80

Open hugolpz opened 3 years ago

hugolpz commented 3 years ago

This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.

In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.

Wanted

If an user want to add a language such as Catalan from Barcelona (ca, cat : missing). What do he needs to jump in quickly ? What should he provide ?

API (to complete)

Defined functions within util.py, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.

Some tools

Main element

Some crawlers for multi-languages sites

Some cleaners

Shorter way to do so

In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.

@sffc, @brawer : anyone could help on that ?

Aayush-hub commented 3 years ago

@hugolpz Can I help adding steps to readme about how to add new crawler starting with basics of installing python?

hugolpz commented 3 years ago

Hello Aayush. thank you for jumping in. I think we can assume ability to install python. Readme.md should just have a section "Requirement" with python version and associated pip dependency

### Requirements
* python x.x+

### Dependencies
`
pip3 instal {package1}
pip3 instal {package2}
pip3 instal {package3}

This would help yes.

I made a large review of this project but I'am JS dev so I walk quite blind here. Yet I think this project isn't that hard to contribute to : the main obstacle is 1. how to start and 2. what kind of output each crawler must provides, how, where.

@brawer, would you temporarily grant me maintainer status so I could handle the possible PRs ? I would be happy to give that userright back as soon as a new, active python dev emerges.

Aayush-hub commented 3 years ago

@hugolpz Sure, looking to add required dependencies information in README :)

Aayush-hub commented 3 years ago

@hugolpz Getting an error no module found : corpuscrawler when running main.py. Can you please help debugging it?

hugolpz commented 3 years ago

JS dev here, I try to help around but I don't know python. I can look for python help but it will need at least 5 days.