m-rey / docscraper

scraper for doc search results. Dump all the things!
0 stars 0 forks source link

Add more data sources #2

Open m-rey opened 3 years ago

m-rey commented 3 years ago
  1. Check whether the following websites are suitable for scraping.
  2. Also make sure that each website actually queries a different database to avoid duplicates. There might be multiple frontends(=websites) for the same database. This can be somewhat checked by using the same search query on each website and comparing the search results.
  3. Create a spider for each site that meets the requirements.

List of websites to consider:

hagenest commented 3 years ago

I don't know how we can reliably identify duplicates between different data sources. It'd be probably best to just choose something simple which every site should have, like Name and City, even if we'd loose a few entries.

A hilariously over-engineered solution would be to use NLP to filter the duplicates though.

Also, why don't we just use existing data from OpenStreetMaps?