raznem / parsera

Lightweight library for scraping web-sites with LLMs
https://parsera.org
GNU General Public License v2.0
732 stars 47 forks source link

Changing Extraction Prompt #2

Closed portoaj closed 4 weeks ago

portoaj commented 1 month ago

The current API of this package is extremely simple which is great, but it's missing customization of the key feature of the library which is how the package extracts information.

Right now the only option is the Tabular Extractor.

I think at minimum there are 3 options needed.

The tabular extractor which has output of the form: [ {"link": "https://example.com/link1"}, {"link": "https://example.com/link2"}]

The list extractor which has the format: {"link": ["https://example.com/link1", "https://example.com/link2"]}

And finally a single item extractor which has the format: {"link": "https://example.com/link1", "title": "Example title"} (A better use for this one would be to get a single value i.e. the number of likes for a single video)

Thankfully the steps toward implementing these changes seem pretty reasonable given the code quality of the backend.