Optimize crawling performance

marco-c commented 6 years ago

Right now the crawler is quite slow, I think the slowest part is finding all the elements. Perhaps we should apply a greedy approach instead, and just click on the first available element.

MadinaB commented 6 years ago

Are you talking about run_in_driver() method in crawler.py? It generates sequence sequence = run_in_driver(website, driver) and then for each element in sequence following is done:

for element in sequence:
    f.write(json.dumps(element) + '\n')

I think this part can be converted to some execution pool since it does not look to depend on outcomes of any: it simply runs some method and writes output to separate file.

for website in websites:
    data_folder = str(uuid.uuid4())
    os.makedirs(data_folder, exist_ok=True)
    try:
        sequence = run_in_driver(website, driver)
        with open('{}/steps.txt'.format(data_folder), 'w') as f:
            f.write('Website name: ' + website + '\n')
            for element in sequence:
                f.write(json.dumps(element) + '\n')
    except:  # noqa: E722
        traceback.print_exc(file=sys.stderr)
        close_all_windows_except_first(driver)

I think making some execution pool with async behavior would be good for this issue. I will try to run before and after this change with cProfile to see whether performance is being affected and tell if it will.

rhcu commented 6 years ago

@MadinaB Just a suggestion: one of the slowest parts of crawler after downloading artifacts and interacting with elements is diffing between two reports. This can be made faster by fixing this: https://github.com/mozilla/grcov/issues/77 since doing this with Rust will be faster, as I think. Another one is parsing 'output.json' to HTML. Now, it is done by parsing coveralls to lcov, and then parsing lcov to html with 'genhtml'. This also may be done faster with solving an issue from grcov: https://github.com/mozilla/grcov/issues/94

marco-c commented 6 years ago

@MadinaB no, run_in_driver should be fine in terms of performance. I was talking about the way we select the next element to interact with: https://github.com/mozilla/coverage-crawler/blob/de41978840a025db07aca5434c57c767e7b05fc4/coverage_crawler/crawler.py#L102.

Finding all the elements is slow.

mozilla / coverage-crawler

Optimize crawling performance #151