modernmt / DataCollection

Data collection, alignment and TAUS repository
Apache License 2.0
20 stars 8 forks source link

Script candidates2corpus.py needs days to run for large language pairs #6

Open achimr opened 7 years ago

achimr commented 7 years ago

For large language pairs with about 1.2 million candidate pairs this script takes days to run. While in this case 2.4 million web pages get downloaded and processed, it would still be useful to determine where the bottle neck lies:

  1. the downloading
  2. the extraction of the candidate text from HTML
  3. the text processing (including the external text processor
  4. the saving of the text in BASE 64 encoding

Example command line:

nohup cat candidates.en-es.locations | ~/DataCollection/baseline/candidates2corpus.py -source_splitter='/scripts/ems/support/split-sentences.perl -l en -b -q' -target_splitter='/scripts/ems/support/split-sentences.perl -l es -b -q'  2> candidates2corpus.log > en-es.down &

Profile code with 10s to 100s of candidate pairs.

achimr commented 7 years ago

Ran the python profiler cProfile on the first 100 candidates from the 2015_32 en_es data collection. These are the percentages of the cumulative time for the above steps:

  1. 54% for downloading
  2. 33% extraction of candidate text from HTML
  3. 10% text processing/tokenization
  4. <1% saving the text in BASE64 - doesn't register in the top-100 routines sorted by time

This was run in the AWS us-east-1 region where the CommonCrawl data is located as well.

So downloading the content does take the majority of the time, however about 44% are spent to extract the text from HTML and tokenize it. Some avenues to investigate:

achimr commented 7 years ago

After some investigation: running the code in parallel with gnu parallel doesn't work because the input file has input records (page pairs) divided over two lines - these can get separated across different input blocks with gnu parallel and the data from the separated records cannot be downloaded/aligned. So separating the downloading from the extracting/processing and making them both parallelizable seems to be the best avenue. Also to separate network-bound loads from CPU-bound processing which then can be optimized separately.

BTW - the last avenue described above to download HTML from the meta-data service is not advisable as it would make the downloading of parallel corpora dependent on the availability of the meta-data service.

achimr commented 7 years ago

It also seems unnecessary to extract text from the HTML in the WARC files as the plain text is already available in the WET files http://commoncrawl.org/the-data/get-started/

achimr commented 6 years ago

Investigated options to enable parallel downloading:

  1. Separate downloading from extraction/processing as described above and then parallelize with GNU parallel
  2. Use aiohttp Python module to enable parallel downloading with multiple threads; the actual downloading is embedded deep in ccdownloader.py, so need to check on side-effects