Closed LeandroC89 closed 4 years ago
Thanks for this great contribution. Excel support is definitely a big missing piece in the puzzle so far.
I think after merging it, we/me should try to modularize the artifacts to prevent an ever growing set of dependencies and to still keep the slim possibility that krangl could support multiplatform someday. So artifacts should be organized somehow into
de.mpicbg.scicomp:krangl-corel
de.mpicbg.scicomp:krangl-excel
de.mpicbg.scicomp:krangl-smile
(withe Machine learning bindings to https://haifengl.github.io/)
.. some more as listed in the submodules under examples
in the repo
de.mpicbg.scicomp:krangl-all
(a main aggregator pom for those who don't care about dependency trees
Does this make sense to you? I would clearly take care of this modularization.
Specifically concerning the PR, I think we should apply some fine-tuning to improve compliance with the state of the art (which is https://readxl.tidyverse.org imho. In particular, we/me/you should
readSheetFromExcelFile
--> readExcel
(same for write)rowNumber
with cellRange
because users tend to put multiple tables into single sheets. See https://readxl.tidyverse.org/reference/read_excel.html for the spec of this featurecolTypes
as used in DataFrame.readCSV
and also needed for excel because poor type handling (see https://readxl.tidyverse.org/reference/read_excel.html)Int
indices as "sheet" argumentI know this is a lot, and I'm happy to help with the implementation, but it may take some days/weeks before I manage to realize all these requirements. Could you support by taking over some of these extensions?
Thank you, sure I can make the changes. As I don't have much time I'll be taking on only a couple of them at a time, and I'll add comments every time I start working on another so that we don't overlap in case you're working on them too.
For now I'll start with:
Depending on how my weekend goes I'll try to tackle as much of the others as I can.
Best Regards
Changes made:
Will start working on:
I'll also add these smaller changes
I believe I didn't answer the modularization part, but yes, it makes perfect sense to me.
Edit: Corrected "minimal code reuse" since I meant the exact opposite.
Most of the things are already updated, there is one that I may need some light on what to do:
I've added the Dokka Samples and even ran the Javadoc generation to make sure the new functions are there (unversioned so they wont be pulled). Other than this I don't know what I should do, so any info would much appreciated.
I only now noticed that the cellRange is a bit more complex than I initially thought (specifying only the rows/cols, anchor and so on). I thought about doing it with a wildcard, for example "A:D would consider columns A,B,C,D up until the first blank row, or even "A1:*" stopping at the first blank column and row but I'll be checking how similar I can make it to read_excel without sacrificing much. So I may need to work on it during the weekend, that should also give me some time to refactor and retest all the functionalities properly.
Thanks
Thank you @LeandroC89 for this wonderful contribution.
Concerning cellranges, let's keep it simple. IMHO a simple D3:F4 without any anchoring or exclusions or wildcards is fine. What we could do potentially is to not use String
as parameter type in readExcel
but CellRangeAdress
(which could be named just CellRange
) directly. This would allow adding more complex cell range definitions later by simply enhancing CellRange
with secondary constructors or more complex builder patterns.
One thing which I'll do (but you could prepare for it already) is to use dedicated files such as ExcelIO.kt, Exceltests.kt and so on for all code that imports poi
. As pointed out in my initial comment, we should start providing submodules to minimize dependencies. I'll do that once we have merged your PR.
Concerning the javadoc, please see my code review comment.
Feel welcome if you need any help, or think that I should take over the remaining bits and pieces.
Hello @holgerbrandl and thank you for all the tips/reviews,
cellRange is now of type CellRangeAddress and if not provided will instead default to sheet values first/last row/column (which only take into account the populated rows)
I also added a few more functionalities in order to allow more flexibility without the users having to prepare the files beforehand, and moved the apache poi dependent functions to the new .kt files.
As for the javadoc I was unable to find the review comment (or the review) on the Files changed tab, I'm a bit new to contributions so I'm probably not looking in the right places and may need some help on that.
If there is any feature you think would be useful, or that something is missing/needing an update I'll gladly help as much as I can but feel free to take over if/when you think it best.
Thank you
Edit: Typo and phrasing
Thanks again for this great contribution.
I've added just a default parameter. sheetIndex=0 in readExcel
to enable usage with just a file argument.
58
I've added excel read/write functions and their respective tests. (Plus two files, one for data and another generated by the test) I've also had to add a dependency to apache poi
For now I kept excel values as strings, as Excel is a bit tricky with the way it stores its data (and their typing), this way we can read a data with the displayed value instead of excel's internal format (same for numbers where the .0 would be added).
I've been using these a lot on a project but since I've made some changes/refactoring in order to contribute they don't have all the live testing their previous versions had (although I still tested on my projects and a few variations to be sure)
Please let me know your feedback on this
Thanks and best regards