scrapinghub / extruct

Extract embedded metadata from HTML markup
BSD 3-Clause "New" or "Revised" License
838 stars 114 forks source link

Problem parsing rdfa in aws lambda #102

Open kamilliano opened 5 years ago

kamilliano commented 5 years ago

Hi, wanted to ask if anyone out there has used extruct on AWS lambda? I tested running extruct function which seems to fail to work for rdfa. Other default metadata types are fine.

A simple test case:

import pprint as pp
import requests
from extruct.rdfa import RDFaExtractor
import config_files.logging_config as log

logger = log.logger

def main():

    try:
        import extruct
        logger.info("Testing importing extruct which loaded successfully")
        import rdflib
        logger.info("Testing importing rdflib which loaded successfully")
        import extruct.rdfa
        logger.info("Testing importing rdfa which loaded successfully")
        from extruct.rdfa import RDFaExtractor
        logger.info("Testing importing RDFaExtractor which loaded successfully")

     except ImportError as e:
            logger.error("failed to import : {}".format(e))

    try:
        url = 'https://www.littlewoods.com/ri-plus-floral-trumpet-sleeve-top/1600159211.prd'
        r = requests.get(url)
        rdfae = RDFaExtractor()
        rdfa_json = rdfae.extract(r.text, base_url=None)

        pp.pprint(rdfa_json)

    except Exception as e:
        logger.exception("Failed to extract rdfa. Error: {}".format(e))

main()

The part of pipenv graph for extruct when I build the artifact.zip file:

extruct==0.7.1
  - lxml [required: Any, installed: 3.6.0]
  - mf2py [required: Any, installed: 1.1.2]
    - BeautifulSoup4 [required: >=4.6.0, installed: 4.7.1]
      - soupsieve [required: >=1.2, installed: 1.6.2]
    - html5lib [required: >=1.0.1, installed: 1.0.1]
      - six [required: >=1.9, installed: 1.11.0]
      - webencodings [required: Any, installed: 0.5.1]
    - requests [required: >=2.18.4, installed: 2.18.4]
      - certifi [required: >=2017.4.17, installed: 2018.11.29]
      - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
      - idna [required: >=2.5,<2.7, installed: 2.6]
      - urllib3 [required: >=1.21.1,<1.23, installed: 1.22]
  - rdflib [required: Any, installed: 4.2.2]
    - isodate [required: Any, installed: 0.6.0]
      - six [required: Any, installed: 1.11.0]
    - pyparsing [required: Any, installed: 2.3.0]
  - rdflib-jsonld [required: Any, installed: 0.4.0]
    - rdflib [required: >=4.2, installed: 4.2.2]
      - isodate [required: Any, installed: 0.6.0]
        - six [required: Any, installed: 1.11.0]
      - pyparsing [required: Any, installed: 2.3.0]
  - six [required: Any, installed: 1.11.0]
  - w3lib [required: Any, installed: 1.19.0]
    - six [required: >=1.4.1, installed: 1.11.0]

When I run this locally in the same pipenv env (Ubuntu 17.10, Docker, 17.12.0-ce, pipenv==v2018.11.26), I don't experience any issues. On lambda invocation I log the following stack trace:

2019-01-10 14:32:49,092:INFO:pid 1:Testing importing extruct which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdflib which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing rdfa which loaded successfully
2019-01-10 14:32:49,092:INFO:pid 1:Testing importing RDFaExtractor which loaded successfully
2019-01-10 14:32:51,753:ERROR:pid 1:Failed to extract rdfa. Error: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)
Traceback (most recent call last):
  File "/var/task/rdflib/plugin.py", line 100, in get
    p = _plugins[(name, kind)]
KeyError: ('json-ld', <class 'rdflib.serializer.Serializer'>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/var/task/metadata_extractor/rdfa_extract_poc.py", line 15, in main
    rdfa_json = rdfae.extract(r.text, base_url=None)
  File "/var/task/extruct/rdfa.py", line 35, in extract
    return self.extract_items(tree, base_url=base_url, expanded=expanded)
  File "/var/task/extruct/rdfa.py", line 48, in extract_items
    jsonld_string = g.serialize(format='json-ld', auto_compact=not expanded).decode('utf-8')
  File "/var/task/rdflib/graph.py", line 940, in serialize
    serializer = plugin.get(format, Serializer)(self)
  File "/var/task/rdflib/plugin.py", line 103, in get
    "No plugin registered for (%s, %s)" % (name, kind))
rdflib.plugin.PluginException: No plugin registered for (json-ld, <class 'rdflib.serializer.Serializer'>)

I have been scratching my head over this but can't figure this one out. What should I try? Thanks in advance

lopuhin commented 5 years ago

Hi @kamilliano thanks for providing a detailed bug report. My guess is that json-ld plugin is not registered properly on Lambda as it should, see https://github.com/RDFLib/rdflib-jsonld#using-the-plug-in-jsonld-serializerparser-with-rdflib , likely using this bit of code https://github.com/RDFLib/rdflib-jsonld/blob/070d45cad067276e72df5d8f362aee65c158df40/setup.py#L106-L113 using this machinery https://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery-of-services-and-plugins, so there must be a way to register it manually

lopuhin commented 5 years ago

I think it's possible to register plugins directly via https://github.com/RDFLib/rdflib/blob/1503dae6049f0d9b14d9d7f884d8de4cb38b39a3/rdflib/plugin.py#L88 , like this https://github.com/RDFLib/rdflib/blob/1503dae6049f0d9b14d9d7f884d8de4cb38b39a3/rdflib/plugin.py#L164-L166 - do you think you can try this @kamilliano ?

kamilliano commented 5 years ago

@lopuhin Thanks for quick response. I will have a stab at this.

kamilliano commented 5 years ago

@lopuhin I did try to register 'json-ld' plugin directly and that works. Thanks for that I owe you a beer. I just didn't fully understand the internals. I think my problem is more from my side and to do with the project structure and how imports are managed in some places (I haven't written all myself so I am not taking blame for all this time ;) ). I tried to inspect globals and locals between local environment and that of lambda, everything seems to be there, but can't fully understand how pipes fit together at runtime...

lopuhin commented 5 years ago

@kamilliano great that it worked, thanks for digging into it! If possible, can you share the bit of code to register the plugins, so that if someone else has the same problem, they can find your solution?

I think my problem is more from my side and to do with the project structure and how imports are managed in some places

I'm not sure, but I don't think it's the issue with your project code - plugin should be registered automatically without any extra imports or anything else. I suspect that this could be AWS Lambda's fault here (in the way they handle python package installation), or the fault of the tool which packages everything for Lambda - not quite sure how it works. But to check this properly one would probably need to create a very simple and small project that would demonstrate that.

kamilliano commented 5 years ago

@lopuhin the following is the registration that I used in the module calling the main function above:

from rdflib.plugin import register
from rdflib.serializer import Serializer

register(
    'json-ld', Serializer,
   'rdflib_jsonld.serializer', 'JsonLDSerializer')
Gallaecio commented 5 years ago

Should we cover this in the documentation and consider the issue fixed once that is done? There does not seem to be much else we can do.

Gallaecio commented 4 years ago

I’ve just tried to use extruct in Amazon Lambda myself.

Because extruct depends on lxml, you need a Docker image. From then on, it should all work as in a regular Linux machine, so I don’t think there’s anything we need to cover in the documentation, unless there is enough interest to cover this in a FAQ section.