metabase / metabase

The simplest, fastest way to get business intelligence and analytics to everyone in your company :yum:
https://metabase.com
Other
37.31k stars 4.95k forks source link

Option for downloading CSV with no rows limit #3810

Closed ant1j closed 7 years ago

ant1j commented 7 years ago

It would be great to facilitate data extraction through Metabase.

A few potential enhancements to do so:

ibushong commented 7 years ago

Not being able to download all rows is a pretty big issue for us. Having something like a "download all" button would be great.

axb21 commented 7 years ago

I'd also like an option to download CSV files with at least a higher row limit than the current 10,000. I can see the rationale for the limit, but I am exploring using metabase as a way to deliver insights to non-technical people in my company, and every now and then they'll generate a result set with slightly more than 10,000 rows. The most recent one was 13,000 rows or so, and it was frustrating that they could not hit a button and download the whole thing. Just a little too big.

I am not familiar at all with the metabase codebase, having literally just looked at it minutes ago. However, I hacked a bit and found that changing one line in src/metabase/api/dataset.clj (as of commit 4844ecc17dd7110662005055ea07709495ad46a9) sufficed for my purposes, at least as far as I'm able to tell so far. I have attached a patch if you're curious. The runnable jar file is easy to build following the developer notes and so far it's running fine in my environment and people are happily downloading CSV files with >10k rows. Again, I have no clue whether this will wreak havoc on the rest of the codebase, so word of warning.

up-download-row-limit..txt

hugocar commented 7 years ago

The fact the limit wasn't enforced in version 0.20 is causing troubles for me. Users, where I work, were beginning to get used to export big data sets and are seeing this limit as a bug/regression. I'm not sure what to tell them.

tlrobinson commented 7 years ago

Confirmed this is a regression in v0.21.0. Downloads of the Orders table in the sample dataset are limited to 2,000 rows even though the copy in the UI says "The maximum download size is 1 million rows.". In v0.20.3 I get all 17,624 rows.

salsakran commented 7 years ago

@tlrobinson I think this is happening on the frontend.

I'm fairly certain the query dictionary being serialized for POST params in https://github.com/metabase/metabase/blob/master/frontend/src/metabase/query_builder/components/DownloadWidget.jsx#L32 includes the limits.

tlrobinson commented 7 years ago

@salsakran I think you're right for unsaved questions. It looks like we started using json_query from the results, which includes the constraints object. If we strip out that object it downloads all rows. PR at #4037.

However, saved questions are still limited to 2k rows by the backend (the frontend doesn't send the query at all in that case).

axb21 commented 7 years ago

@tlrobinson I think you're right about that. If you have a look at the patch I posted earlier, you'll see I changed code on the backend that allowed me to retrieve >2,000 rows from a saved question; I neglected to mention in my previous comment that my observations concerned saved questions. Anyway, I'm not steeped in this codebase, but this all suggests to me that at least in some cases the backend is controlling how many rows are ever sent to the frontend.

tlrobinson commented 7 years ago

@axb21 Thanks for looking into this. The constant you modified is the limit for normal queries (in the query builder and dashboards), but we want to ignore that for CSV downloads and use the 1 million row limit instead. Also I think your change will only increase the limit for aggregated queries, not "bare rows", which has a separate limit of 2000. We don't actually want to modify either of those limits.

axb21 commented 7 years ago

@tlrobinson gotcha, that makes sense. One thing I can say that might be of interest: I searched through the source for the number 10,000--the largest number of rows I was ever able to download--and only found that number in a handful of places:

$ find src/ -type f -exec egrep -nil '\b10000\b' {} \;
src/metabase/util/stats.clj
src/metabase/api/dataset.clj
src/metabase/driver/googleanalytics/query_processor.clj
src/metabase/driver.clj

The patch I posted changes that number in src/metabase/api/dataset.clj, which I understand isn't going to work. The other places where 10000 appears did not look (to my naive eyes) like the right places to make a change like this.

salsakran commented 7 years ago

@tlrobinson I don't think the front end should control how many rows get sent back. This is a limit in place for the backend to remain stable, and it should control how many rows get sent back.

I'll keep poking, I stopped there as I thought that was the root bug. =(

tlrobinson commented 7 years ago

@salsakran My PR already prevents the frontend from sending the constraints object. Are you saying the backend should not respect constraints passed in the request as well?

The following endpoints override constraints with the hard-coded query-constraints (10k grouped rows / 2k bare rows)

(those last 2 are why the saved question download endpoints are still limited to 10k/2k rows)

And the following do not (allowing constraints to pass through from the request)

Perhaps we should just change all the CSV/JSON download endpoints to discard constraints and not replace with query-constraints, so the query processor will use absolute-max-results (1048576, apparently the same as Excel https://github.com/metabase/metabase/blob/master/src/metabase/query_processor.clj#L28-L33 )

tlrobinson commented 7 years ago

@axb21 The 10000 constant you found (max-results) is the correct one, but I'm saying that limit shouldn't be applied for the download endpoints at all.

mmerchant commented 7 years ago

@tlrobinson Just to clarify -- on the CSV export it is intended to limit the export to 10,000 rows? Or is that constraint removed? I tried on the v0.22.0-rc1 tag and do still run into the 10K row limit on CSV exports.

tlrobinson commented 7 years ago

@mmerchant Yes but I think this was merged after RC1. We'll put out another RC later today.

luqmansyauqi commented 7 years ago

@tlrobinson is this already implemented in v0.22.1?

SchwiftyOne commented 6 years ago

@tlrobinson @axb21 .I know this is an old issue, but does changing (max-results) allow the front-end and/or the CSV export to send more than 10,000 rows? I'm very interested in using Metabase in our corporate environment but the 10,000 row limit is major problem for how we'd want to use it and if changing the constant in the source is fixes this that would solve a big problem for us.

salsakran commented 6 years ago

How do you want to use it?

We're kind of at a mostly stable settling point where most people find 10k enough, while some people on small instances still manage to crash their server once every so often. In general, we think of Metabase as a tool for interactive BI vs pushing big files around. There are a lot of little assumptions on that regard.

If you can give us a clear sense of what you're trying to do with it, we can perhaps find a way to get it done. In general though, we're not trying to be a "pull massive files out of databases" play, and at a certain size (depending on your data warehouse, server specs and browser), you'll want to have a dedicated data export tool in conjunction with us.

SchwiftyOne commented 6 years ago

We're not looking to download huge datasets (1million rows+), but the queries we'd like to have exported are around 200k to 300k rows or so. I understand that Metabase is primarily a BI visualization tool, so I don't expect it to do huge data set exports but something in that range would be great for us.

mazameli commented 6 years ago

Wait, so the limit for the number of downloadable rows is 10k and not 1 million rows? If so, I'm going to change this message in the UI:

screen shot 2017-08-01 at 12 45 38 pm
axb21 commented 6 years ago

@SchwiftyOne yes. I've been happily using a modified version of metabase for over a year now and a higher limit than 10k rows. I couldn't tell you the largest result set someone's downloaded, but it's well over 10k rows and metabase has worked fine.

@salsakran Yes, it makes sense that metabase is not tuned to query and return millions of rows. But an arbitrary, software-selected-at-deploy-time, hard limit on the number of returned rows seems unnecessarily constraining. My preference would be that I, as the person installing and administering a metabase instance, have the power to override this limit if I so choose, and it's on me if I create havoc. Really the request here is to turn a hardcoded limit into a configurable option and it's hard to understand why that'd be controversial unless there's no one available to implement it.

SchwiftyOne commented 6 years ago

@axb21 Thanks for the feedback! That's great that you can modify it. My only problem now is how to compile it. Unfortunately I don't have a mac, and reading up on how to compile the metabase source to jar, it seems you need a mac to do so. You wouldn't happen to have a compiled metabase jar with your modifications available for download somewhere?

salsakran commented 6 years ago

Sorry, I misspoke.

The limits are 2k for raw table in-browser display 10k for aggregations in-browser display 1M rows for downloads.

@SchwiftyOne it seems like the number you mention (200-300k) should happen without too much drama. You will want to make sure your instance has enough memory, and you might need to tweak some config settings if you start seeing download failures.

axb21 commented 6 years ago

@SchwiftyOne according to @salsakran's last response you shouldn't need my version of the jar. Honestly I'm several minor versions behind and I hacked in what I needed, so I'd suggest sticking with an official metabase release if it can handle what you need. I'd be happy to share offline what I hacked if it turns out the official release doesn't work for you. Good luck!

SchwiftyOne commented 6 years ago

I just tested it out and it worked!. Didn't realize that the display limit was 10k but the CSV limit was 1M. Thanks @salsakran and @axb21

MiBaDK commented 6 years ago

Only getting 100.000 rows. Version 0.25.2. Has this been changed from the 1M?

paoliniluis commented 5 months ago

https://github.com/metabase/metabase/issues/28144