openaustralia / yinyo

A wonderfully simple API driven service to reliably execute many long running scrapers in a super scaleable way
https://yinyo.io
Apache License 2.0
6 stars 1 forks source link

Scraper output appears to be out of order #73

Closed jamezpolley closed 4 years ago

jamezpolley commented 4 years ago

Reproduction:

Expected outcome:

Observed outcome:

I've attached complete output from a run of the scraper. In particular, the problem was noticeable here:

Collecting feed data for leichhardt...
Storing D/2019/448 - Multiple Occupancy 2-4 Johnston Street ANNANDALE NSW 2038
Storing D/2019/449 - 5 Colgate Avenue BALMAIN NSW 2041
Storing T/2019/156 - 27 Halloran Street LILYFIELD NSW 2040
Storing T/2019/157 - 78 Booth Street ANNANDALE NSW 2038
Storing BC/2019/88 - 93 Short Street BIRCHGROVE NSW 2041
Storing CCP/2019/318 - 75 Phillip Street BIRCHGROVE NSW 2041
Storing D/2019/450 - 14 Campbell Avenue LILYFIELD NSW 2040
Storing D/2019/451 - 133 View Street ANNANDALE NSW 2038
Storing D/2019/452 - 25 Nelson Street ANNANDALE NSW 2038
Storing M/2019/196 - 2 Lonsdale Street LILYFIELD NSW 2040
Storing OC/2019/336 - 27 Reuss Street BIRCHGROVE NSW 2041
Storing OCP/2019/333 - 127 Young Street ANNANDALE NSW 2038
Storing OCP/2019/334 - 49 Annesley Street LEICHHARDT NSW 2040
Storing OCP/2019/335 - 50 Percival Street LILYFIELD NSW 2040
Storing PCAP/2019/421 - 75 Phillip Street BIRCHGROVE NSW 2041
Storing S37P/2019/11 - 10 Loughlin Street ROZELLE NSW 2039
Storing CCP/2019/319 - 6 Booth Street BALMAIN NSW 2041
Storing CDCP/2019/173 - 16 Athol Street LEICHHARDT NSW 2040
Storing D/2019/453 - 23 William Street BALMAIN EAST NSW 2041
Storing M/2019/197 - 1 River Street BIRCHGROVE NSW 2041
randwick: ERROR: execution expired
Storing OCP/2019/337 - 50 Lamb Street LILYFIELD NSW 2040
/app/vendor/ruby-2.5.1/lib/ruby/2.5.0/net/http.rb:937:in `initialize'
Storing PREDA/2019/189 - 41 Wellington Street ROZELLE NSW 2039
/app/vendor/ruby-2.5.1/lib/ruby/2.5.0/net/http.rb:937:in `open'
Storing R/2019/100 - 161 Norton Street LEICHHARDT NSW 2040
/app/vendor/ruby-2.5.1/lib/ruby/2.5.0/net/http.rb:937:in `block in connect'
Storing SC/2019/21 - 25 Coleridge Street LEICHHARDT NSW 2040
/app/vendor/ruby-2.5.1/lib/ruby/2.5.0/timeout.rb:103:in `timeout'
Storing CCP/2019/320 - 172 Evans Street ROZELLE NSW 2039
/app/vendor/ruby-2.5.1/lib/ruby/2.5.0/net/http.rb:935:in `connect'
Storing CCP/2019/321 - Aldersgate Nursing Home 16 Fredbert Street LILYFIELD NSW 2040
/app/vendor/ruby-2.5.1/lib/ruby/2.5.0/net/http.rb:920:in `do_start'
Storing CDCP/2019/174 - 97 Edith Street LEICHHARDT NSW 2040
/app/vendor/ruby-2.5.1/lib/ruby/2.5.0/net/http.rb:915:in `start'
Storing CDCP/2019/175 - 45 Burfitt Street LEICHHARDT NSW 2040

I note that https://github.com/planningalerts-scrapers/icon_scraper/issues/4 seems to be similar; but in that case, it appears that all the lines of an error message do appear together, but they appear in the middle of the regular output from another scraper. Here. the error lines seem to be interspersed line-by-line with regular output from another scraper.

multiple_icon_output.txt

mlandauer commented 4 years ago

It looks to me like the output of stdout is in the correct order and stderr is in the correct order but the two streams are getting intermingled in the wrong way.

@jamezpolley any chance that you could write a super simple test example in python (python is quicker to build with buildstep than ruby) that reproduces this problem? We could then add that to the test suite.

mlandauer commented 4 years ago

Thanks @jamezpolley for catching this bug!

mlandauer commented 4 years ago

I've made no progress on solving this today. Very annoying.

mlandauer commented 4 years ago

Well this is a surprisingly tricky problem. I've spent probably a day all up working on this so far without a clear solution but with a new clear line of enquiry. I'll do my best to summarise what I've learned so far:

Standard output and standard error are pipes which fundamentally are acting in parallel

So, it's just not possible to ensure a consistent ordering between the two streams. However, inside each stream the ordering should be consistent.

We can see this behaviour with this very simple python test case

from __future__ import print_function

import sys

# This tests that stdout and stderr are sent in the correct order

print("Line 1 (to stderr)", file=sys.stderr)
print("Line 2 (to stderr)", file=sys.stderr)

print("Line 3 (to stdout)")
print("Line 4 (to stdout)")

When you run this scraper you get the results in all kinds of different orders:

Note that the order inside of each stream is preserved but the order between the two streams appears random.

Now if you put in a slight pause:

from __future__ import print_function

import sys
import time

# This tests that stdout and stderr are sent in the correct order

print("Line 1 (to stdout)")
print("Line 2 (to stdout)")

time.sleep(0.1)

print("Line 3 (to stderr)", file=sys.stderr)
print("Line 4 (to stderr)", file=sys.stderr)

You get a consistent ordering of 1 2 3 4

Looking back at the original example of running the scraper https://github.com/planningalerts-scrapers/multiple_icon.git we see a different behaviour

If you watch closely while the scraper is running and you see the output of standard output and standard error being interleaved you'll actually notice that there is a slight pause between each pair of lines (one of stdout and one of stderr). This behaviour can not be explained by the concurrency issues above. So, we do have a different problem on our hands.

Next step is to try to reproduce the problem with a much simpler test scraper. I think this could be challenging.

mlandauer commented 4 years ago

Some progress... I've been able to reproduce the problem with a fairly simple test case:

from __future__ import print_function

import sys
import time

for line in range(1, 31):
    print("Line", line, "(to stderr)", file=sys.stderr)

for line in range(1, 21):
    print("Line", line)
    time.sleep(0.5)

This produces the output (leaving out the build part):

Line 1 (to stderr)
Line 1
Line 2 (to stderr)
Line 3 (to stderr)
Line 4 (to stderr)
Line 5 (to stderr)
Line 6 (to stderr)
Line 7 (to stderr)
Line 8 (to stderr)
Line 9 (to stderr)
Line 10 (to stderr)
Line 11 (to stderr)
Line 12 (to stderr)
Line 2
Line 13 (to stderr)
Line 3
Line 14 (to stderr)
Line 15 (to stderr)
Line 4
Line 16 (to stderr)
Line 5
Line 17 (to stderr)
Line 18 (to stderr)
Line 6
Line 19 (to stderr)
Line 7
Line 20 (to stderr)
Line 21 (to stderr)
Line 8
Line 22 (to stderr)
Line 9
Line 23 (to stderr)
Line 24 (to stderr)
Line 10
Line 25 (to stderr)
Line 26 (to stderr)
Line 11
Line 27 (to stderr)
Line 12
Line 28 (to stderr)
Line 29 (to stderr)
Line 13
Line 30 (to stderr)
Line 14
Line 15
Line 16
Line 17
Line 18
Line 19
Line 20

So, it looks like if you output enough stuff at once to stderr it gets buffered (or something like that) and then only clears some of it when you output something to stdout. This is definitely not the behaviour we expect or want.

mlandauer commented 4 years ago

First solution to this problem was to create a single place where the http requests to the server are made. Currently the requests are made in two separate goroutines, one for each stream (stdout, stderr). To do this we basically make a simple background queue with a fixed size. We're essentially saying "log this bit of text" as quickly as possible when it comes in and putting that from both streams on to a single queue (which then retains more of the expected order).

We run into another problem when the queue fills up. We've set things relatively sensibly to apply "back pressure" on the stream goroutine by waiting when it tries to add a log item to the queue and the queue is filled up. When this happens we are basically in the same place as when we didn't have a queue.

It turns out the queue fills up really quickly because the http requests are really slow. There are two potential solutions to this:

  1. Make the http requests not slow
  2. Send multiple events in a single http request

After much toing and froing it looked like attacking (1) head on was the best approach for now. (2) is more of an optimisation whereas (1) is fixing a problem where an http request that we expect to be taking a few ms takes around 500ms.

mlandauer commented 4 years ago

This was actually fixed by the same PR as fixed #116