LeandroC89 commented 4 years ago

58

I've added excel read/write functions and their respective tests. (Plus two files, one for data and another generated by the test) I've also had to add a dependency to apache poi

For now I kept excel values as strings, as Excel is a bit tricky with the way it stores its data (and their typing), this way we can read a data with the displayed value instead of excel's internal format (same for numbers where the .0 would be added).

I've been using these a lot on a project but since I've made some changes/refactoring in order to contribute they don't have all the live testing their previous versions had (although I still tested on my projects and a few variations to be sure)

Please let me know your feedback on this

Thanks and best regards

holgerbrandl commented 4 years ago

Thanks for this great contribution. Excel support is definitely a big missing piece in the puzzle so far.

I think after merging it, we/me should try to modularize the artifacts to prevent an ever growing set of dependencies and to still keep the slim possibility that krangl could support multiplatform someday. So artifacts should be organized somehow into

de.mpicbg.scicomp:krangl-corel
de.mpicbg.scicomp:krangl-excel
de.mpicbg.scicomp:krangl-smile (withe Machine learning bindings to https://haifengl.github.io/)
.. some more as listed in the submodules under examples in the repo
de.mpicbg.scicomp:krangl-all (a main aggregator pom for those who don't care about dependency trees

Does this make sense to you? I would clearly take care of this modularization.

Specifically concerning the PR, I think we should apply some fine-tuning to improve compliance with the state of the art (which is https://readxl.tidyverse.org imho. In particular, we/me/you should

simplify naming readSheetFromExcelFile --> readExcel (same for write)
replace in rowNumber with cellRange because users tend to put multiple tables into single sheets. See https://readxl.tidyverse.org/reference/read_excel.html for the spec of this feature
Add dokka example (there are some in the codebase already, so this should be quick)
Add javadoc to public API
Consider adding colTypes as used in DataFrame.readCSV and also needed for excel because poor type handling (see https://readxl.tidyverse.org/reference/read_excel.html)
Add a secondary public read method (while reusing as much as code as possible) that accepts also Int indices as "sheet" argument
Add column type-guessing as in krangl.TableIOKt#peekCol and https://readxl.tidyverse.org/reference/read_excel.html

I know this is a lot, and I'm happy to help with the implementation, but it may take some days/weeks before I manage to realize all these requirements. Could you support by taking over some of these extensions?

LeandroC89 commented 4 years ago

Thank you, sure I can make the changes. As I don't have much time I'll be taking on only a couple of them at a time, and I'll add comments every time I start working on another so that we don't overlap in case you're working on them too.

For now I'll start with:

simplify naming readSheetFromExcelFile --> readExcel (same for write)
Add a secondary public read method (while reusing as much as code as possible) that accepts also Int indices as "sheet" argument
Consider adding colTypes as used in DataFrame.readCSV and also needed for excel because poor type handling (see https://readxl.tidyverse.org/reference/read_excel.html)

Depending on how my weekend goes I'll try to tackle as much of the others as I can.

Best Regards

LeandroC89 commented 4 years ago

Changes made:

Renaming read/write functions
Overloaded readExcel to allow sheet retrieval by Index (minimal code duplication)
Added colTypes parameter
Added column type guessing via peekCol (will test this one further and add guessMax limit)

Will start working on:

replace in rowNumber with cellRange because users tend to put multiple tables into single sheets. See https://readxl.tidyverse.org/reference/read_excel.html for the spec of this feature
Add dokka example (there are some in the codebase already, so this should be quick)
Add javadoc to public API

I'll also add these smaller changes

Renaming some of the parameters to match read_excel and write_excel specifications
Ensure names are unique as in readCSV

I believe I didn't answer the modularization part, but yes, it makes perfect sense to me.

Edit: Corrected "minimal code reuse" since I meant the exact opposite.

LeandroC89 commented 4 years ago

Most of the things are already updated, there is one that I may need some light on what to do:

Add javadoc to public API

I've added the Dokka Samples and even ran the Javadoc generation to make sure the new functions are there (unversioned so they wont be pulled). Other than this I don't know what I should do, so any info would much appreciated.

I only now noticed that the cellRange is a bit more complex than I initially thought (specifying only the rows/cols, anchor and so on). I thought about doing it with a wildcard, for example "A:D would consider columns A,B,C,D up until the first blank row, or even "A1:*" stopping at the first blank column and row but I'll be checking how similar I can make it to read_excel without sacrificing much. So I may need to work on it during the weekend, that should also give me some time to refactor and retest all the functionalities properly.

Thanks

holgerbrandl commented 4 years ago

Thank you @LeandroC89 for this wonderful contribution.

Concerning cellranges, let's keep it simple. IMHO a simple D3:F4 without any anchoring or exclusions or wildcards is fine. What we could do potentially is to not use String as parameter type in readExcel but CellRangeAdress (which could be named just CellRange) directly. This would allow adding more complex cell range definitions later by simply enhancing CellRange with secondary constructors or more complex builder patterns.

One thing which I'll do (but you could prepare for it already) is to use dedicated files such as ExcelIO.kt, Exceltests.kt and so on for all code that imports poi. As pointed out in my initial comment, we should start providing submodules to minimize dependencies. I'll do that once we have merged your PR.

Concerning the javadoc, please see my code review comment.

Feel welcome if you need any help, or think that I should take over the remaining bits and pieces.

LeandroC89 commented 4 years ago

Hello @holgerbrandl and thank you for all the tips/reviews,

cellRange is now of type CellRangeAddress and if not provided will instead default to sheet values first/last row/column (which only take into account the populated rows)

I also added a few more functionalities in order to allow more flexibility without the users having to prepare the files beforehand, and moved the apache poi dependent functions to the new .kt files.

As for the javadoc I was unable to find the review comment (or the review) on the Files changed tab, I'm a bit new to contributions so I'm probably not looking in the right places and may need some help on that.

If there is any feature you think would be useful, or that something is missing/needing an update I'll gladly help as much as I can but feel free to take over if/when you think it best.

Thank you

Edit: Typo and phrasing

holgerbrandl commented 4 years ago

Thanks again for this great contribution.

I've added just a default parameter. sheetIndex=0 in readExcel to enable usage with just a file argument.

holgerbrandl / krangl

Added Excel read write functions #97

58