qrilka / xlsx

Simple and incomplete Excel file parser/writer
MIT License
128 stars 62 forks source link

Add writing support for large excell files (streaming?) #132

Closed jappeace closed 8 months ago

jappeace commented 3 years ago

We currently have to deal with massive excel files (100 000 to millions of rows). Reading all this data into memory isn't feasible. 100 000 rows for example takes up 2 gigs of ram as Xslx representation, 1 is over 8 gig (our app crashed trying to do this).

If we could represent the file as a stream it would solve this problem. I'd suggest conduit because it combines with zip as well.

qrilka commented 3 years ago

There was a ticket #7 about this As there was no particular motivation it was closed. If you're willing to contribute something like this I'd be glad to review the code.

jappeace commented 3 years ago

Working around it now with csv but there is definitely some interest here from the business side to producing xslx files, so I'll be doing this in the future.

Your review and feedback would be appreciated :wink:

idontgetoutmuch commented 3 years ago

I have just run into this problem and will probably now have to convert to the file to CSV and use streaming via cassava.

idontgetoutmuch commented 3 years ago

@jappeace if you made any progress on this which you can share, I could maybe work on it a bit more.

idontgetoutmuch commented 3 years ago

@jappeace I notice you have done a lot of work here: https://github.com/jappeace/xlsx.

What I'd like to be able to do is read a row at a time e.g. to create a sum of a column. Is it possible to do this with your clone?

jappeace commented 3 years ago

yes! Supercede is implementing this at the moment, the api is unstable but for reading it sort of works already here: https://github.com/SupercedeTech/xlsx [1]

We may change how the reading api works, (there are two engineers working on this now including me), but we settled on the shape of SheetItem at least so you can program against that (stuff maybe added). I'm working on writing xlsx as well. The prototype for writing is done, but it may still have some bugs left.

We're intending to upstream the work once we got it to work reliably.

[1] note that's a different repo than mine, I originally implemented it as part of "research day" but now it turns out client actually need it so it got moved to supercede github.

jappeace commented 3 years ago

Let me know if you need help, have questions or any issues. It's really important for the company I work for to get this right :slightly_smiling_face:

idontgetoutmuch commented 3 years ago

If you could show me how I would get e.g. rows 5, 6 and 7 from an .xlsx file and also how I can process a table row by row that would be really helpful.

At the moment I am using R to convert a tab in an .xlsx file to a .csv file and then using cassava with something like the following to get rows with indices in is:

  let bar :: [[String]] -> Records [String] -> IO [[String]]
      bar acc baz =
        case baz of
          Cons (Left s)          _ -> error s
          Cons (Right [])        r -> bar acc r
          Cons (Right a@(x : _)) r -> if x `elem` (map show is)
                                        then bar (a : acc) r
                                        else bar acc r
          Nil _ _                  -> return acc

https://hackage.haskell.org/package/cassava-0.5.2.0/docs/Data-Csv-Streaming.html lets you process a row at a time which is what I want to do after I have processed the given rows.

Many thanks

jappeace commented 3 years ago

First obtain a producer conduit from your xlsx file, probably you want to use sourceFile. Then put that into readxlsx. Now you have a conduit of sheetItems, which you can filter, and fold.

Mind you to use those combinator you have to combine with .| which makes the output of the left conduit the input to the right one.

It all seems a bit convoluted but it allows you to do lazy io (constant memory).

jappeace commented 8 months ago

this is done.