html-extract / hext

Domain-specific language for extracting structured data from HTML documents
https://hext.thomastrapp.com
Apache License 2.0
52 stars 3 forks source link

TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract' #27

Closed impredicative closed 1 year ago

impredicative commented 1 year ago

With Python 3.12, hext.Rule('').extract('') gives the error:

  File "python3.12/site-packages/hext/__init__.py", line 139, in extract
    return _hext.Rule_extract(self, html, max_searches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'.
  Possible C/C++ prototypes are:
    Rule::extract(Html const &,std::uint64_t) const
    Rule::extract(Html const &) const

I am of course also getting this error with a more real-life example. At this time I cannot use hext for anything new.

thomastrapp commented 1 year ago

Rule.extract does not accept a string, only hext.Html.

import hext
rule = hext.Rule("<a href:link/>")
# (1) Ok, the argument for extract is of type hext.Html
results = rule.extract(hext.Html("""<a href="b"></a>"""))
# (2) Error, the argument for extract is of type string:
results = rule.extract("""<a href="b"></a>""")

If this was possible in a previous version of Hext (≥1.0.0), please let me know, as this would be a breaking change in the API.

The error message is unfortunately very unhelpful, and I will fix that in a future release with html-extract/hext#28.

Thank you for creating this issue.

brandonrobertz commented 1 year ago

If this was possible in a previous version of Hext (≥1.0.0), please let me know, as this would be a breaking change in the API.

This was not possible in 0.8 (just re-tested to be sure). AFAIK you always needed to pass a Html object.

impredicative commented 1 year ago

Yes, it had been a while since I used hext, and I misremembered. Indeed hext.Rule('').extract(hext.Html('')) is what works.

As an aside, I think there really needs to exist at least one comprehensive page (or tabs) per supported programming language in the documentation. It would contain various necessary examples to train the user to use hext effectively.

impredicative commented 1 year ago

As an example, please see the organization and tabs here (one tab per supported language).

thomastrapp commented 1 year ago

As an aside, I think there really needs to exist at least one comprehensive page (or tabs) per supported programming language in the documentation. It would contain various necessary examples to train the user to use hext effectively.

I agree and have added another issue for this: html-extract/html-extract.github.io#4.