google / corpuscrawler

Crawler for linguistic corpora
Other
192 stars 55 forks source link

Shorten project structure #82

Open hugolpz opened 3 years ago

hugolpz commented 3 years ago

Related to #80. Suggestion. Mainly, move the core codes up so it is more visible. The crawlers are kept into their own folder.

Would such changes disturb some complementary toolchain ?

hugolpz commented 7 months ago

Hello @sffc . I noticed you made some py change https://github.com/google/corpuscrawler/commit/10adaecf4ed5a7d0557c8e692c186023746eb001 and are active on this project, so allow me to cc you on this minor issue.

sffc commented 7 months ago

The project is currently structured as a PIP module, and it should stay a PIP module. However I would support reorganizing the utilities and crawlers into separate directories, but more along the lines of:

corpuscrawler
├─ README.md
├─ LICENSE
├─ LICENSE.md
├─ CONTRIBUTING.md
├─ corpuscrawler
└─ Lib
   └─ corpuscrawler
      ├─ util
      |   └─ *.py: utilities
      └─ crawlers
          └─crawl_{iso}.py : crawlers
hugolpz commented 7 months ago

This would add clarity yes. This current project lacks clear on-boarding manuals and pointers. A clean structure splitting the few utils from the 1000+ crawlers files would be an improvement for clarity and on-boarding.