rabanti-github / PicoXLSX

PicoXLSX is a small .NET / C# library to create XLSX files (Microsoft Excel 2007 or newer) in an easy and native way
MIT License
52 stars 13 forks source link

What about basic read access? #3

Closed big-shadow closed 6 years ago

big-shadow commented 6 years ago

Hey,

I'm entertaining using this. Based on the documentation it seems like this tool is intended to write to disk only. Can I use this to simply iterate and read spreadsheet rows?

I'm avoiding the interlop assemblies for such a simple task.

rabanti-github commented 6 years ago

Hi,

Yes, the library is a write-only tool. A basic reader is planned in another project (NanoXLSX) but not fully implemented yet. Maybe I can help you in this case. Reading Excel files depends on the type of data that is stored. I can give you some hints:

If your data is completely non-text (numeric, date, boolean, formulas), then it is quite easy to read values from a worksheet entity. However, If you want to read an Excel file, with text (strings), then it can become a little bit more complicated. Excel ist storing each text value of a cell into a file called "sharedStrings.xml". This is beneficial for the file size if strings are reoccuring. In the worksheets are only the IDs of the shared strings referenced (one string can be references several times by a numeric ID). To read such a worksheet, the SharedString file must be read first and resolved.

If you want to read dates, then you have to deal with the rather odd OADate format. Each day is one full number (start 1.1.1900) and 24 hours are a fraction of 1.

Numeric values and booleans are very easy to handle. They are not altered.

Formulas are also easy since they are only expressed as text. However, the internal formulas are different from the formulas you see in Excel.

If you want to read styles or diagrams, then it can become really complicated. Especially styles can be quite a mess.

So, what are your plans?

big-shadow commented 6 years ago

Awesome. Thank you. I'm reading fairly simple accounting data. It's almost entirely integer and decimal data typed columns. I'll be outputting a format similar to JSON.

rabanti-github commented 6 years ago

Hi, Integer and decimal data are (internally) the same in Excel. you can read it as double values. Maybe have a look at another of my projects: https://github.com/rabanti-github/Media-Extractor. MediaExtractor has already a method implemented to unpack Office files (docx, pptx, xlsx). Furthermore, it has a preview function to look at the embedded XML files in those file formats. But you have to check/enable the menu item "Options > Show Other Embedded Files" to see the XML files.

Then you can have a look at the worksheet file(s) without the necessity of unpacking. You will certainly find your values there and can figure out which XML tags and attributes are relevant for your use case.

After unpackung (e.g. by the unpacking function of MediaExtractor) you can read and simply parse the XML file(s) with an XML reader.