mixpanel / mixpanel-utils

Other
85 stars 62 forks source link

Large export killed quickly #28

Closed zeusstl closed 3 years ago

zeusstl commented 3 years ago

I wrote a similar less detailed message in the related closed thread: #2

My goal is to export all of the data on a regular basis. I've built my entire process around the csv that this outputs, but when the data is too big, it all fails. I can run this locally on mac, but on the linux server it gets killed

Running out of virtual memory at about 100MB (at which point the virtual memory used shoots really high really fast). The server has plenty of memory. Still using python 2.7.12 for this per the README.

DEBUG: Request Method: GET DEBUG: Request URL: https://data.mixpanel.com/api/2.0/export?from_date=2020-10-01&to_date=2021-02-01 DEBUG: Request Headers: {"Accept-encoding": "gzip", "Authorization": "Basic xxxxxxxhidden_for_sharingxxxx"}

kernel: [2136433.332390] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name kernel: [2136433.332727] [27865] 0 27865 1051926 1040932 2057 6 0 0 python "Out of memory: Kill process 27865 (python) score 495 or sacrifice child"

  1. @jaredmixpanel mentioned to try the "expert box" in issue #2. What is the expert box?

  2. Any recommendations for resolving this issue?

zeusstl commented 3 years ago

Update: I've increased the server memory to 16GB and it still manages to burn through all of the memory and doesn't manage to download the data. (It actually crashes the server now - probably something to do with oom killer settings issue.)

jaredmixpanel commented 3 years ago

@zeusstl expert box was a shared server here at Mixpanel, that issue #2 was posted by a Mixpanel employee.

most likely you need to reduce the date range of your export. Every event for the last 4 months is probably too big of a request.

you could loop through shorter date ranges yourself or use the request_per_day=True parameter to make one export (and file) per day.

or you could try using the raw_stream=True parameter to stream the newline delimited JSON directly to output file. note that you won't end up with a single, large, valid JSON object, but instead, one valid JSON object per event (on each line).

zeusstl commented 3 years ago

Thanks @jaredmixpanel.

My issue with pulling part of the data is that mixpanel returns a different set of columns depending on the date range selected. This makes importing the data extremely difficult (rather than just a simple LOAD INTO for all columns, we need to modify the data or insert it row by row which is extremely expensive).

Are you aware of a way to get all possible columns from the api regardless of the date range (with empty fields in those columns)?

jaredmixpanel commented 3 years ago

@zeusstl there are no "columns" in Mixpanel, it's like a schemaless/NoSQL/JSON document store. so if a particular property name does not occur in any of the events during a given time range then it's not in the data returned. so unfortunately, no, there is no way to force the api to scan your entire event history for every property name that has ever occurred and then insert empty values for the ones that don't appear in each event. this is just a raw data export api.

however, we do have apis for generating schemas from your data: https://developer.mixpanel.com/reference/retrieve-schemas

sounds like you are using format='csv'? note that Mixpanel does not return CSV, the mixpanel_api python module is doing the data transform from JSON to CSV on your machine, which may well be the part that's causing the OOM. you might want to try using format='json'and then doing the transform to a columnar format yourself, in a more memory efficient process. if the JSON export still gets killed you can try raw_stream=True.