Open mchiquier opened 4 years ago
This is extremely difficult, and arguably it's own project. I consider this a "nice to have", but it should be the last thing we do.
scraping data also leads to problematic territory like terms of service and them just straight up blocking your ass for taking their data, etc... And not all sites have nice apis for scraping, so some general scraper would be hard.
Ideally this is modular but it would be good to have a boilerplate for scraping data from:
-twitter -youtube -tiktok -instagram
Or even just google drive & pipe-lining it into a PyTorch dataloader.