roniemartinez / dude

dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
https://roniemartinez.github.io/dude/
GNU Affero General Public License v3.0
420 stars 19 forks source link

Study: ML-based scrapers #70

Open roniemartinez opened 2 years ago

roniemartinez commented 2 years ago

Possible format:

@select(sample="path/to/training/data")
def handler(result):
    return {"data": result}

Potential backends:

daniel7an commented 2 years ago

Hey, It's a good idea to use mlscraper as a backend. But first of all, we need data (inputs and outputs).

roniemartinez commented 2 years ago

@daniel7an

Yes, I can see potential on this one.

daniel7an commented 2 years ago

@roniemartinez

Autoscraper is another one that would be great to have in Dude. It learns the scraping rules and returns similar elements. It just needs a few examples and isn't complicated as mlscraper.

Input: wanted_list = ["What are metaclasses in Python?"]

Output: [ 'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 'How to call an external command?', 'What are metaclasses in Python?', 'Does Python have a ternary conditional operator?', 'How do you remove duplicates from a list whilst preserving order?', 'Convert bytes to a string', 'How to get line count of a large file cheaply in Python?', "Does Python have a string 'contains' substring method?", 'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?' ]

Any ideas to add this one to Dude? Should I open a new issue for this?

roniemartinez commented 2 years ago

@daniel7an

The thing is, I've been reading the source code of Autoscraper and it is not actually using Machine Learning or AI. It is just using difflib.SequenceMatcher. What the project claims that it runs on ML or AI are incorrect.

https://github.com/alirezamika/autoscraper/blob/973ba6abed840d16907a556bc0192e2bf4806c6d/autoscraper/utils.py#L42-L66

image

Please correct me if I am wrong. I cannot categorize it as such, but for sure it learns by saving rules.

roniemartinez commented 2 years ago

@daniel7an

Any ideas to add this one to Dude? Should I open a new issue for this?

Though it seems Autoscraper does not fall into this category, I believe it is a very powerful tool for web scraping and I'd love to include it. Please open a separate ticket.

daniel7an commented 2 years ago

@daniel7an

Any ideas to add this one to Dude? Should I open a new issue for this?

Though it seems Autoscraper does not fall into this category, I believe it is a very powerful tool for web scraping and I'd love to include it. Please open a separate ticket.

Done ✅