mjul / docjure

Read and write Office documents from Clojure
MIT License
622 stars 129 forks source link

Blank rows/cells are omitted from sequences and from select-columns #19

Closed kornysietsma closed 7 years ago

kornysietsma commented 10 years ago

This is basically the same as the issue my colleague nurous raised against clj-excel: https://github.com/mebaran/clj-excel/pull/5

The root problem is that iterators in POI skips blank rows and cells. As a result, a blank row won't be returned in a row-seq call, and a blank cell won't be returned in a cell-seq call. Worse, when running functions like select-columns the results can be quite wrong due to these problems.

(note that this is for truly blank cells - a spreadsheet seems to be implemented somewhere as a sparse array. Cells where you've entered data previously, and then deleted the data, may still be returned by these functions)

mjul commented 10 years ago

Thanks for reporting this issue. If you have the time, please submit a fix.

kornysietsma commented 10 years ago

Actually, on writing some tests for an upcoming patch, I found that select-columns actually mostly works - it does skip blank rows/cells, but as it uses .getColumnIndex to work out column mappings, it assigns the correct column labels anyway.

What is the preferred behaviour for select-columns on a blank row/column?

mjul commented 10 years ago

With the Clojure semantics of maps, empty map for a blank row would have the same result as the latter solution in normal use cases, so I would support your recommendation, blank row = empty map.

mjul commented 10 years ago

PS: please note that the repo has moved, the old master still works but for good measure consider changing your master to github.com/mjul/docjure (was: github.com/ative/docjure)

dmichulke commented 8 years ago

For all those with this problem, as a quick fix you can just use with-redefs without changing dependencies or cloning the repo:

(with-redefs [dk.ative.docjure.spreadsheet/row-seq (fn [^Sheet sheet]
                      (map #(.getRow sheet %) (range 0 (inc (.getLastRowNum sheet)))))]
    ;;load your excel file here 
)
vkz commented 8 years ago

Just to check if the project is still alive. Ability to read sparse tables correctly is kind of a big deal and at least my experiments with this PR show it's legit and does the right thing. Be great to have it merged.

Big kudos to @kornysietsma for the fix! Big thanks to creators and maintainers!

Note: with this fix rows of different lengths may require padding at the end. Not a big deal and probably not something you want by default but might trip someone. Only matters in cases where you deal with valid tables with some rows without data in their tail-cells.

mjul commented 8 years ago

@vkz I agree. If someone would update the pull req (#22) from @kornysietsma to match the latest version and add some example spreadsheets with missing rows and cells to the test suite, I would be happy to merge it.

kornysietsma commented 8 years ago

I'll try to update it - hopefully it shouldn't take long, I just need prodding :-)

On 11 Aug 2016 8:52 a.m., "Martin Jul" notifications@github.com wrote:

@vkz https://github.com/vkz I agree. If someone would update the pull req (#22 https://github.com/mjul/docjure/pull/22) from @kornysietsma https://github.com/kornysietsma to match the latest version and add some example spreadsheets with missing rows and cells to the test suite, I would be happy to merge it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mjul/docjure/issues/19#issuecomment-239094884, or mute the thread https://github.com/notifications/unsubscribe-auth/AABFZTogQdEj-I-pHeWd68GJR-_R5HdKks5qetS3gaJpZM4Ba29W .

kornysietsma commented 8 years ago

I've been looking again at this patch, and I realised why I didn't get back to it - it needs some decisions on how best to implement it, and my preferred option may be a breaking change for some clients.

The problem is, ideally I'd want to just replace the existing row-seq and cell-seq with the dense functions, so they always return nil for missing rows / cells. However, this could easily be a breaking change if there are clients that don't expect nil return values (for example, the docjure 'set-row-style!' function would need to be modified to ignore nil cells)

Would this be OK?

Alternatively, you could make this optional - pass an extra missing-cell-policy flag, or set a local binding like missing-cell-policy that specifies which approach to use, and add a macro like (with-missing-cell-policy :return-nil ...) - but this might be needless complication.

So, would you prefer:

  1. Two parallel sets of functions (as per my original patch) cell-seq and dense-cell-seq, row-seq and dense-row-seq, users can choose which they prefer by calling different functions
  2. Modify cell-seq and row-seq to return nils for blank cells/rows, possible breaking change
  3. Let the user choose sparse or dense mode 3a. via an extra option parameter to lots of functions 3b. via a local binding
    • Korny

On 11 August 2016 at 12:39, Kornelis Sietsma kornys@gmail.com wrote:

I'll try to update it - hopefully it shouldn't take long, I just need prodding :-)

On 11 Aug 2016 8:52 a.m., "Martin Jul" notifications@github.com wrote:

@vkz https://github.com/vkz I agree. If someone would update the pull req (#22 https://github.com/mjul/docjure/pull/22) from @kornysietsma https://github.com/kornysietsma to match the latest version and add some example spreadsheets with missing rows and cells to the test suite, I would be happy to merge it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mjul/docjure/issues/19#issuecomment-239094884, or mute the thread https://github.com/notifications/unsubscribe-auth/AABFZTogQdEj-I-pHeWd68GJR-_R5HdKks5qetS3gaJpZM4Ba29W .

Kornelis Sietsma korny at my surname dot com http://korny.info .fnord { display: none !important; }

kornysietsma commented 8 years ago

Hey - trying to close this off... :)

In the absence of responses to my previous over-long essay, I'm going to go with the slightly messy option of letting people choose which way docjure will work via a dynamic var and a wrapper macro.

That seems to be the only way to preserve current behaviour, without duplicating a whole bunch of functions, or adding extra parameters all over the place

So if you wrap code as follows: (with-missing-values-as-nil (for [row (row-seq sheet) cell (cell-seq row)] ... )

then missing rows and cells will be returned as nil

I'm cleaning up the tests and docs, will probably get a pull request in soon.

mjul commented 8 years ago

Hi @kornysietsma I think your initial initial suggestion of always returning nils for blank rows makes more sense. That it is not backwards compatible is not a problem: it is a bug-fix, after all. Thanks for helping out. Cheers, Martin

dmichulke commented 8 years ago

Not my decision but I'd also prefer option 2 - returning nil for blank rows/cells. The reason is that removing nils from the dense version is easy with standard clojure functionality. Adding nils to the non-dense version at the right places is impossible.

So the former is more general one but still requires only few workarounds to make it work as before.

Probably the changelog should contain the word BREAKING CHANGE and you might want to increase the major version.

kornysietsma commented 8 years ago

Cool - I'll go down that route then, it'll keep the code and the tests much simpler.

On 6 September 2016 at 08:54, Daniel Michulke notifications@github.com wrote:

Not my decision but I'd also prefer option 2 - returning nil for blank rows/cells. The reason is that removing nils from the dense version is easy with standard clojure functionality Adding nils to the non-dense version at the right places is impossible.

So the former is more general one but still requires only few workarounds to make it works a before.

Probably the changelog should contain the word BREAKING CHANGE and you might want to increase the major version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mjul/docjure/issues/19#issuecomment-244876316, or mute the thread https://github.com/notifications/unsubscribe-auth/AABFZT97_KoVhiMUisrtzX3XhrwBq9yhks5qnRwhgaJpZM4Ba29W .

Kornelis Sietsma korny at my surname dot com http://korny.info .fnord { display: none !important; }

kornysietsma commented 8 years ago

I've updated the pull request #22

kinleyd commented 7 years ago

@mjul: Thanks for a great tool. All the others here too, esp. also @kornysietsma.

On topic: I think returning nil always for blank rows and cells is the right decision. It will make row-seq and cell-seq much more dependable.

mjul commented 7 years ago

The fix for this issue from @kornysietsma has been included in version 1.11.0. Thanks for helping make the library better.

mjul commented 7 years ago

@kinleyd Thanks for the kind words. I was very happy to see that you are using it in Bhutan as it is one of my favourite countries. If you have any ideas or suggestions, don't hesitate to contribute to the project.

mjul commented 7 years ago

@dmichulke Thanks for the thoughtful feedback. I decided not to upgrade the major version as I consider this a bug-fix (blank lines being skipped was a thing leaking through from the underlying POI library). Please keep your suggestions and pull requests coming.

mjul commented 7 years ago

@vkz This has now been fixed, so please give it a whirl and submit feedback, suggestions and pull requests to make it even better.

mjul commented 7 years ago

@kornysietsma Thanks a lot for helping out with this. It is really good work. Please send more pull requests in the future.

kinleyd commented 7 years ago

@mjul: Nice to know you like Bhutan! docjure is a great library and I look forward to making some contributions where I can. Keep up the good work.

On Mon, Sep 19, 2016 at 7:22 PM Martin Jul notifications@github.com wrote:

@kornysietsma https://github.com/kornysietsma Thanks a lot for helping out with this. It is really good work. Please send more pull requests in the future.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mjul/docjure/issues/19#issuecomment-247991174, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWrvyNOCM-1J13C485_dTMtSG_OQO2Yks5qrox2gaJpZM4Ba29W .