itamarst / eliot

Eliot: the logging system that tells you *why* it happened
https://eliot.readthedocs.io
Apache License 2.0
1.1k stars 66 forks source link

How to use Eliot with scrapy? #439

Closed laurentS closed 4 years ago

laurentS commented 4 years ago

First of all, thanks for the great library. I haven't gotten it to work yet, but I'm already impressed :)

I am writing a fairly complex scraper with scrapy that involves fetching files in a tree-like way: one first index file yields a number of other files to fetch, each of them having several sections to process independently. The processing is fairly complex, so I am struggling to track errors, and eliot looks like a super promising solution (my first attempt was very exciting to look at although it doesn't quite work yet).

In short, scrapy is built on top of twisted, but I obviously don't want to modify scrapy's code as described in the docs. To make things worse, the initial scrapy process uses generators everywhere, so keeping track of the current action is tricky.

scrapy uses the concept of an item that it passes around between callbacks to transfer data. There is an initial scraping process which yields requests to which this item is attached. Then scrapy passes this item to a series of pipelines, each of which gets the item, modifies it and returns it.

To keep the context in Eliot, I tried serializing the action the_id = action.serialize_task_id() and then picking up the context with with Action.continue_task(task_id=the_id):. It works partially. The first time I continue_task, the logs look ok, but if I try to do it more than once, the logs look like:

83dc01af-a089-4e2e-8b70-8a690363712c
└── parse_report/1 ⇒ started 2019-11-28 14:51:52 ⧖ 0.146s
    ├── eliot:stdlib/2 2019-11-28 14:51:52
    │   ├── log_level: INFO
    │   ├── logger: spider
    │   └── message: Processing XXXX
    ├── HI/3 2019-11-28 14:51:52
    ├── eliot:remote_task/4/1 ⇒ started 2019-11-28 14:51:52 ⧖ 3.789s
    │   ├── parse_item/4/2/1 ⇒ started 2019-11-28 14:51:52 ⧖ 3.787s
    │   │   ├── eliot:stdlib/4/2/2 2019-11-28 14:51:52
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Parsing data for item1
    │   │   ├── eliot:stdlib/4/2/3 2019-11-28 14:51:53
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Error parsing item1
    │   │   ├── eliot:stdlib/4/2/4 2019-11-28 14:51:53
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Parsing data for item2
    │   │   ├── eliot:stdlib/4/2/5 2019-11-28 14:51:56
    │   │   │   ├── log_level: INFO
    │   │   │   ├── logger: pipeline
    │   │   │   └── message: Parser completed item2
    │   │   └── parse_item/4/2/6 ⇒ succeeded 2019-11-28 14:51:56
    │   └── eliot:remote_task/4/3 ⇒ succeeded 2019-11-28 14:51:56
    └── parse_report/5 ⇒ succeeded 2019-11-28 14:51:52

...more stuff...

83dc01af-a089-4e2e-8b70-8a690363712c   <--- this ID matches the one above
└── <unnamed>   <-- but from here, the context seems to be lost
    └── eliot:remote_task/4/1 ⇒ started 2019-11-28 14:51:56 ⧖ 0.038s
        ├── eliot:stdlib/4/2 2019-11-28 14:51:56
        │   ├── log_level: INFO
        │   ├── logger: pipeline
        │   └── message: Saving item1 to DB
        ├── eliot:stdlib/4/3 2019-11-28 14:51:56
        │   ├── log_level: INFO
        │   ├── logger: pipeline
        │   └── message: Saving item2 to DB
        └── eliot:remote_task/4/4 ⇒ succeeded 2019-11-28 14:51:56

The code looks like (these are the standard scrapy callbacks):

def parse_report(...):
    with start_action(action_type="parse_report") as action:
        do_stuff()
        item['task_id'] = action.serialize_task_id()
        return item_with_the_task_id

class Pipeline1:
    def process_item(self, item, spider):
        with Action.continue_task(task_id=item['task_id']):
            do_the_parsing_stuff()
        return item

class Pipeline2:
    # essentially the same as Pipeline1 but saving to DB

Is this kind of pipeline logic supported by continue_task, or am I trying to use the wrong solution here? To make it clear, each pipeline.process_item() is called once per item, and I then run a loop on each subitem inside each pipeline. Ideally, I want the logs to reflect that tree structure, to ease tracing errors. Any ideas would be great!

itamarst commented 4 years ago

Unfortunately you can't add action context from inside individual callbacks... You need to start actions outside, but then as you say DeferredContext is needed to track things.

Once Twisted has native support for contextvars (https://github.com/twisted/twisted/pull/1192), the use of eliot.twisted.DeferredContext will in many cases become unnecessary. At that point any wrapped Twisted code should work better, you won't need to mess with the internals.

As a stopgap measure you can:

  1. Use DeferredContext on the the entry point to a scrapy Deferred-based API. This will at least give you some action context.
  2. Redirect scrapy's logs to Eliot (https://eliot.readthedocs.io/en/stable/generating/twisted.html if they use Twisted logging, https://eliot.readthedocs.io/en/stable/generating/migrating.html if they use stdlib's logging package). Looks like you're already doing this.
  3. Just have normal actions inside your callbacks.

Not ideal, but something.

itamarst commented 4 years ago

I guess looking at what you're doing in more detail, that might also be a reasonable stop gap, just: you can only call continue_task once. Serialized task IDs are single use.

laurentS commented 4 years ago

Thanks for the fast reply! This was super helpful. I've tried a few different options, and so far, my understanding is:

So it's not perfect, I still have weird non-chronological output in some places, and it requires carefully thinking about where the context manager calls and serialization calls happen, but the results is already way more readable than what I had only 3 hours ago :tada: Thanks for your help! (and the cool logging library)