mjul / docjure

Read and write Office documents from Clojure
MIT License
622 stars 129 forks source link

Add support for streaming API. #47

Open bagl opened 8 years ago

bagl commented 8 years ago

Issue #46.

One function for dumping data into XLSX file using streaming and one macro for doing the same, but explicitly by row using append-row! function (define in the scope of the macro body).

I do not know if append-row! scoped to macro body is a good idea...

bagl commented 8 years ago

Alternative withou using macro (just pass a function in that takes function to append row as its argument)

(defn with-streaming-workbook-fn!
  [file-name sheet-name body-fn]
  (let [wb (SXSSFWorkbook. 1)
        ^Sheet sh (add-sheet! wb sheet-name)
        row-num (atom 0)
        append-row!
        (fn [row-data]
          (let [r (.createRow sh @row-num)]
            (swap! row-num inc)
            (doseq [[j value] (map-indexed #(list %1 %2) row-data)]
              (set-cell! (.createCell r j) value))))]
    (try
      (body-fn append-row!)
      (save-workbook-into-file! file-name wb)
      (finally
        (.dispose wb)))))

There are no name collisions then...

mjul commented 8 years ago

Thanks for contributing. It solves the immediate problem for writing data.

However, it does not play well with the sheet styling code. I think the styling code is at fault as it is imperative in nature and not so elegant.

I wonder if you have thought about this and we could find a design that would enable the library to work (to the extent possible) with both in-memory and streaming spreadsheets.

There is an old idea for a V2 design with a data structure that can be piped (thrushed) between the Docjure functions instead of passing the underlying POI objects. This would enable us to use it as a builder (lazily, even) which would be good for streaming and also allow to make the styling and formatting code more elegant as it could have a notion of context ("make the last row bold", "add another sheet at the end").

It would probably involve multi-methods so that the functions can take advantage of the context. For example we can track the selected sheet so that rows added will be added to the last one selected.

With multiple sheets in play it might even be able to jump between them in data processing code as we would build up the structures lazily and still be able to write them sequentially if needed the streaming API.

Please share your thoughts.

kennethkalmer commented 7 years ago

Thanks for this PR, I just tested it by placing the new functions in an isolated namespace and got my fairly small workbooks generated using vanilla lein (-Xmx1g). Workbooks in question are 18MB (~53000 rows, many columns) and 5.9MB (47000 rows, still many columns).

ghost commented 7 years ago

@mjul Is there something preventing this merge? Metabase is depending on docjure to export their tables to excel. And this functionality is currently experiencing serieus memory issues (https://github.com/metabase/metabase/issues/5187).

mjul commented 7 years ago

I have been on vacation and busy with other things. I will have a look ASAP. Cheers, Martin

mjul commented 6 years ago

OK, I looked at it again. My main concern is that it is not really integrated with the other library functions (it allows you to write a single-sheet workbook). However, If you add some test cases to show that it supports the regular cell data types I will add it as is to the version 1 branch, then we can work out a better API for V2 that allows it to play better with the rest of the library.

thucnc commented 6 years ago

@bagl and @milanvdmria, is there any way to move forward, as this issue still blocks Metabase download feature ? Please help, thank you.

ghost commented 6 years ago

@thucnc To be honest, we do not have any in-company knowledge on Clojure and do not have the bandwidth to work on this currently.

paoliniluis commented 1 month ago

Hi @mjul, Luis here from the Metabase team. We use your library to generate XLSX docs and this specific feature can make a big difference on the performance of our application. I believe this is incomplete and that's why it hasn't been merged. Is that correct?

mjul commented 1 month ago

@paoliniluis yes, as mentioned above it lacks test cases