scrapinghub / arche

Analyze scraped data
https://arche.readthedocs.io/
MIT License
47 stars 19 forks source link

Find the common root for URL fields #16

Open manycoding opened 5 years ago

manycoding commented 5 years ago

Say, we scrape one website from different categories. In this case all items will have the same root, e.g. https://pandas.pydata.org/pandas-docs/stable/categorical.html https://pandas.pydata.org/pandas-docs/stable/merging.html have https://pandas.pydata.org/pandas-docs/stable/ in common.

By returning this information, we can analyze urls without json schema.

manycoding commented 5 years ago

pandas-profiling does something similar https://github.com/pandas-profiling/pandas-profiling/blob/master/pandas_profiling/model/base.py#L122