usc-isi-i2 / t2wml

Table to Wikidata Mapping Language
MIT License
22 stars 11 forks source link

Handle very large files #560

Open devowit opened 3 years ago

devowit commented 3 years ago
kyao commented 3 years ago

Use this file for testing: http://databank.worldbank.org/data/download/WDI_excel.zip

kyao commented 3 years ago

Another test dataset. This one is only ~16K rows.

IDP_Time_Series_Data_No_Missing_No_Annotation.zip

devowit commented 3 years ago

@g1eb both files kethia added are rejected by the frontend as too large

g1eb commented 3 years ago

@devowit the second file Ke-Thia (IDP_Time_Series_Data_No_Missing_No_Annotation.csv) shared is only 1.1mb and works fine..

The other file (WDIEXCEL.xlsx) is 70mb and when I remove any frontend limit I get an error from the backend, see screenshot below:

Screen Shot 2021-09-04 at 5 50 11 PM

g1eb commented 3 years ago

I've changed the theoretical limit in the app_config.py:42 from 16mb to a 100mb

MAX_CONTENT_LENGTH = 100 * 1024 * 1024  # 100 MB max file size
kyao commented 3 years ago

@g1eb Would you please do some profiling to see what file size can t2wml handle and how long does it take for the results to return

kyao commented 3 years ago

I tried to upload this 1.8 MB file and got an error:

t2wml-web        | 2021/09/08 05:58:04 [error] 24#24: *42 client intended to send too large body: 1848588 bytes, client: 172.18.0.1, server: localhost, request: "POST /api/causx/upload/data HTTP/1.1", host: "localhost:8080", referrer: "http://localhost:8080/"

WGI_Data.zip

g1eb commented 3 years ago

ahh, one more filter that was not allowing large files - nginx, I fixed that now.

Here's the nginx setting in case causx needs to set the same setting on their end: https://github.com/usc-isi-i2/t2wml-web/commit/d6cfb0127a40c251bd22ffc18503864f829dde6b

kyao commented 3 years ago

I am still getting a timeout error with nginx modification.

t2wml-web        | 172.18.0.1 - - [13/Sep/2021:04:03:46 +0000] "GET / HTTP/1.1" 200 3134 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0" "-"
t2wml-web        | 172.18.0.1 - - [13/Sep/2021:04:03:46 +0000] "GET /static/js/main.682e22e7.chunk.js HTTP/1.1" 200 90382 "http://localhost:8080/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0" "-"
t2wml-web        | 172.18.0.1 - - [13/Sep/2021:04:03:46 +0000] "GET /static/js/2.4b359fb7.chunk.js HTTP/1.1" 200 559369 "http://localhost:8080/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0" "-"
t2wml-web        | 172.18.0.1 - - [13/Sep/2021:04:03:46 +0000] "GET /static/js/3.e6a9a2f8.chunk.js HTTP/1.1" 200 4210 "http://localhost:8080/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0" "-"
t2wml-backend    | 172.18.0.3 - - [13/Sep/2021 04:03:46] "GET /api/causx/token HTTP/1.0" 200 -
t2wml-web        | 172.18.0.1 - - [13/Sep/2021:04:03:46 +0000] "GET /api/causx/token HTTP/1.1" 200 178 "http://localhost:8080/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0" "-"
t2wml-web        | 2021/09/13 04:18:28 [warn] 26#26: *5 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000000001, client: 172.18.0.1, server: localhost, request: "POST /api/causx/upload/data HTTP/1.1", host: "localhost:8080", referrer: "http://localhost:8080/"
t2wml-web        | 2021/09/13 04:23:28 [error] 26#26: *5 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.18.0.1, server: localhost, request: "POST /api/causx/upload/data HTTP/1.1", upstream: "http://172.18.0.2:13000/api/causx/upload/data", host: "localhost:8080", referrer: "http://loc$
t2wml-web        | 172.18.0.1 - - [13/Sep/2021:04:23:28 +0000] "POST /api/causx/upload/data HTTP/1.1" 504 494 "http://localhost:8080/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0" "-"
g1eb commented 3 years ago

Which file is this Ke-Thia? I had to update the limit again to be around 500mb. Let's try again with the updated backend and frontend images in place.

kyao commented 3 years ago

It's WDI. I will try again

kyao commented 3 years ago

I was uploading the CSV version, which is bigger than 100MB

g1eb commented 3 years ago

Right, it should work now with the limit being 500mb, it would still take a long time though. I would wait till we publish the paginated version to upload that.

kyao commented 3 years ago

@devowit I tried suggest annotations on WDI dataset. It look 33 minutes on my machine, and it returned a 2GB json. Most of the json are error messages, which the web front end ignores. How about having an option that suppresses error messages.

devowit commented 3 years ago

currently suggest annotations returns a full layer result, not just the suggested annotation.

There are two options:

  1. suggest annotations returns the annotation only, not a full layer result
  2. @g1eb sends start and end parameters so that results are only fetched for x number of rows.
g1eb commented 3 years ago

We would need the annotations to be present in the layers in order to draw them.

What part of the response is dependent on the start and end indexes? Annotations returned are based on all the rows regardless of the indexes I would provide.