=pod
=head1 NAME
Spreadsheet::XLSX::Reader::LibXML - Read xlsx spreadsheet files DEPRECATED
=begin html
=end html
=head1 SYNOPSIS
The following uses the 'TestBook.xlsx' file found in the t/test_files/ folder of the package
#!/usr/bin/env perl
use strict;
use warnings;
use Spreadsheet::XLSX::Reader::LibXML;
my $parser = Spreadsheet::XLSX::Reader::LibXML->new();
my $workbook = $parser->parse( 'TestBook.xlsx' );
if ( !defined $workbook ) {
die $parser->error(), "\n";
}
for my $worksheet ( $workbook->worksheets() ) {
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print "Row, Col = ($row, $col)\n";
print "Value = ", $cell->value(), "\n";
print "Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
last;# In order not to read all sheets
}
###########################
# SYNOPSIS Screen Output
# 01: Row, Col = (0, 0)
# 02: Value = Category
# 03: Unformatted = Category
# 04:
# 05: Row, Col = (0, 1)
# 06: Value = Total
# 07: Unformatted = Total
# 08:
# 09: Row, Col = (0, 2)
# 10: Value = Date
# 11: Unformatted = Date
# 12:
# 13: Row, Col = (1, 0)
# 14: Value = Red
# 16: Unformatted = Red
# 17:
# 18: Row, Col = (1, 1)
# 19: Value = 5
# 20: Unformatted = 5
# 21:
# 22: Row, Col = (1, 2)
# 23: Value = 2017-2-14 #(shows as 2/14/2017 in the sheet)
# 24: Unformatted = 41318
# 25:
# More intermediate rows ...
# 82:
# 83: Row, Col = (6, 2)
# 84: Value = 2016-2-6 #(shows as 2/6/2016 in the sheet)
# 85: Unformatted = 40944
###########################
=head1 DEPRECATION
I<This module is being replaced by L
=head1 DESCRIPTION
This is an object oriented just in time Excel spreadsheet reader package that should
parse all excel files with the extentions .xlsx, .xlsm, .xml I<L<Excel 2003 xml
|https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats> (SpreadsheetML)> that
can be opened in Excel 2007+ applications. The quick-start example provided in the
SYNOPSIS attempts to follow the example from L
The intent is to fully document all public functions but you may need to go to sub modules to find more detailed documentation. This package operates on the Excel file with three primary tiers of classes. Each level provides object methods to access the next level down.
=over
Workbook level (This doc)
=over
L<Worksheet level|Spreadsheet::XLSX::Reader::LibXML::Worksheet>
=over
L<Cell level|Spreadsheet::XLSX::Reader::LibXML::Cell>
=back
=back
=back
There are some differences from the L
This should alow you to change only the part you want to perform differently if you
have the desire to tinker with the guts. Read the full documentation for all
opportunities!
In the realm of extensibility this package uses L
Future iterations could include a DOM parser option but that is a very low priority.
Currently this package does not provide the same access to the visual format elements
provided in L
To skip the why and nitty gritty of design and jump to implementation details go to the
L<Attributes|/Attributes> section.
=head2 Architecture Choices
This is yet another package for parsing Excel xml or 2007+ workbooks. The goals of this
package are five fold. First, as close as possible produce the same output as is visible
in an excel spreadsheet with exposure to underlying settings from Excel. Second, adhere
as close as is reasonable to the L
Fifth (and finally), the design of this package is targeted at handling as large of an Excel
file as possible. In general this means that design decisions will generally sacrifice speed
to keep RAM consumption low. Specifically this spreadsheet parser does not read the file into
memory completely when it is opened. Since the data in the sheet is parsed just in time the
information that is not contained in the primary meta-data headers will not be available for
review L<until the sheet parses to that point|Spreadsheet::XLSX::Reader::LibXML::Worksheet/max_row>.
In cases where the parser has made choices that prioritize speed over RAM savings there
will generally be an L<attribute available to turn that decision off|/set_cache_behavior>.
All in all this package solves many of the issues I found parsing Excel in the wild. I
hope it solves some of yours as well.
=head2 Warnings
B<1.>This package uses L
I have tested this with Archive::Zip 1.30. Please let me know if this does not work with a
sucessfully installed (read passed the full test suit) version of Archive::Zip newer than that.
B<2.> Not all workbook sheets (tabs) are created equal! Some Excel sheet tabs are only a
chart. These tabs are 'chartsheets'. The methods with 'worksheet' in the name only act on
the sub set of tabs that are worksheets. Future methods with 'chartsheet' in the name will
focus on the subset of sheets that are chartsheets. Methods with just 'sheet' in the name
have the potential to act on both. The documentation for the chartsheet level class is found
in L
B<3.> This package supports reading xlsm files (Macro enabled Excel 2007+ workbooks).
xlsm files allow for binaries to be embedded that may contain malicious code. However, other
than unzipping the excel file no work is done by this package with the sub-file 'vbaProject.bin'
containing the binaries. This package does not provide an API to that sub-file and I have no
intention of doing so. Therefore my research indicates there should be no risk of virus activation
while parsing even an infected xlsm file with this package but I encourage you to use your own
judgement in this area. B<L<caveat utilitor!|https://en.wiktionary.org/wiki/Appendix:List_of_Latin_phrases>>
B<4.> This package will read some files with 'broken' xml. In general this should be
transparent but in the case of the maximum row value and the maximum column value for a
worksheet it can cause some surprising problems. This includes the possibility that the maximum
values are initially stored as 'undef' if the sheet does not provide them in the metadata as
expected. The answer to the methods L<Spreadsheet::XLSX::Reader::LibXML::Worksheet/row_range> and
L<Spreadsheet::XLSX::Reader::LibXML::Worksheet/col_range> will then change as more of the sheet is
parsed. The parser improves these values as information is available based on the dimensional
scope of the users cell parsing. These values are generally never available in Excel 2003 xml files.
The primary cause of these broken XML elements in Excel 2007+ files are non-XML applications writing
to the excel spreadsheet. You can use the attribute L<file_boundary_flags|/file_boundary_flags> or
the methods L<Spreadsheet::XLSX::Reader::LibXML::Worksheet/get_next_value> or
L<Spreadsheet::XLSX::Reader::LibXML::Worksheet/fetchrow_arrayref> as alternates for pre-testing
for boundaries when iterating.
B<5.> Version v0.40.2 changes the way file caching is turned on and off. It also changes the way it is set when starting an instance of this package. If you did not turn off caching explicitly before this release there should no be a problem with this change. The goal is to automatically differentiate large files and small files and L<turn caching off|/cache_positions> in a targeted manner in response to larger file sizes. This should allow larger spreadsheets that may have exceeded available RAM to run (slowly) when they didn't run at all before without forcing small sheets to run too much slower. However, if you do have caching turned off in your code using the old Boolean setting this package will now see it, fix it upon load, and emit a warning. I will still be tweaking this setting over the next few releases. This warning will stay till 3/1/2017 and then the old callout will no longer be supported.
B<6.> Version v0.40.2 introduces the L<file|/file> attribute and will start the deprication of the L<file_name|/file_name> and L<file_handle|/file_handle> attributes as well as the following methods: L<set_file_name|/set_file_name>, L<set_file_handle|/set_file_handle>, L<file_handle|/file_handle>, and L<has_file_handle|/has_file_handle>. This change is intended to remove an overly complex set of dependancies that was causing trouble for garbage collection on cleanup. Please use the L<file|/file> attribute and the L<file_opened|/file_opened> methods as replacements moving forward. Support for backwards compatible use of the old attributes and methods will be removed after 3/1/2017.
B<7.> Version v0.40.2 introduces support for L<SpreadsheetML |https://odieweblog.wordpress.com/2012/02/12/how-to-read-and-write-office-2003-excel-xml-files/> (Excel 2003) .xml extention documents. These documents should include
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
somewhere in the header to indicate their intended format. This change does introduce a lot of
behind the scenes re-plumbing but the top level tests all stayed the same. This means that for
.xlsx and .xlsm extentions there should not be any obvious changes or (hopefully) significant
new bugs. I<Note warnings 5 and 6>. However, to get this release out and rolling I don't have a
full set of tests for the .xml extention paths and Microsofts documentation for that format is
spotty in some relevant areas (I still don't know what I don't know) so please L<submit
|https://github.com/jandrew/Spreadsheet-XLSX-Reader-LibXML/issues> any cases that appear to behave
differently than expected for .xml extention files that are readable by the Excel application.
I am also interested in cases where an out of memory error occurs with an .xml extension file. This
warning will stay till 3/1/2017.
=head2 Attributes
Data passed to new when creating an instance. For modification of these attributes see the
listed 'attribute methods'. For general information on attributes see
L
B
$workbook_instance = Spreadsheet::XLSX::Reader::LibXML->new( %attributes )
I<note: if the file information is not included in the initial %attributes then it must be
set by one of the attribute setter methods below or the L<parse
|parse( $file_nameE
=head3 file_name
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 file_handle
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 error_inst
=over
B
B
( should_warn => 0 )
B
error set_error clear_error set_warnings if_warn
The error instance must be able to extract the error string from a passed error object as well. For now the current implementation will attempt ->as_string first and then ->message if an object is passed.
B
=over
B
=over
B
=back
B
=over
B
=back
B
=over
B
=back
B
=over
B
=back
B
=over
B
=back
B
=over
B
=back
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 sheet_parser
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 count_from_zero
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 file_boundary_flags
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 empty_is_end
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 values_only
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 from_the_edge
=over
B
B
B
B
=over
B
=over
B
=back
=back
=back
=head3 cache_positions
=over
B
Currently four of the files can implement selective caching. The setting for
this attribute takes a hash ref with the file indicators as keys and the max
file size in bytes as the value. When the sub file handle exceeds that size
then caching for that subfile is turned off. The default setting shows an
example with the four available cached size.
B
B
{
sharedStrings => 5242880,# 5 MB
styles => 5242880,# 5 MB
worksheet_interface => 5242880,# 5 MB
chartsheet_interface => 5242880,# 5 MB
}
B
=over
B
=over
B
=back
B<get_cache_size( $target_file )>
=over
B
=back
B<set_cache_size( $target_file => $max_file_size )>
=over
B
=back
B<has_cache_size( $target_file )>
=over
B
=back
=back
=back
=head3 formatter_inst
=over
B
B
B
=over
B<set_formatter_inst( $instance )>
=over
B
=back
B
=over
B
=back
=back
B
delegated_to => link_delegated_from
=over
get_formatter_region => L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/get_excel_region>
get_target_encoding => L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/get_target_encoding>
set_target_encoding => L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/set_target_encoding( $encoding )>
has_target_encoding => L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/has_target_encoding>
change_output_encoding => L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/change_output_encoding( $string )>
set_defined_excel_formats => L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/set_defined_excel_formats( %args )>
get_defined_conversion => L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/get_defined_conversion( $position )>
parse_excel_format_string => L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/parse_excel_format_string( $string, $name )>
set_date_behavior => L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/set_date_behavior( $Bool )>
set_european_first => L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/set_european_first( $Bool )>
set_formatter_cache_behavior => L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/set_cache_behavior( $Bool )>
=back
=back
=head3 group_return_type
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head3 empty_return_type
=over
B
B
B
B
=over
B
=over
B
=back
B
=over
B
=back
=back
=back
=head2 Primary Methods
These are the primary ways to use this class. They can be used to open an .xlsx workbook.
They are also ways to investigate information at the workbook level. For information on
how to retrieve data from the worksheets see the
L<Worksheet|Spreadsheet::XLSX::Reader::LibXML::Worksheet> and
L<Cell|Spreadsheet::XLSX::Reader::LibXML::Cell> documentation. For additional workbook
options see the L<Secondary Methods|/Secondary Methods>
and the L<Attributes|/Attributes> sections. The attributes section specifically contains
all the methods used to adjust the attributes of this class.
All methods are object methods and should be implemented on the object instance.
B
my @worksheet_array = $workbook_instance->worksheets;
=head3 parse( $file_name|$file_handle, $formatter )
=over
B
It only works if the L<file_name|/file_name> or L<file_handle|/file_handle> attribute was not
set with ->new. It is one way to set the 'file_name' or 'file_handle' attribute [and the
L<default_format_list|/default_format_list> attribute]. I<You cannot pass both a file name
and a file handle simultaneously to this method.>
B
$file = a valid xlsx file [or a valid xlsx file handle] (required)
[$formatter] = see the default_format_list attribute for valid options (optional)
B
=back
=head3 worksheets
=over
B
For alternatives see the L<get_worksheet_names|/get_worksheet_names> method and the
L<worksheet|/worksheet( $name )> methods. B<For now it also only returns the tabular
worksheets in the workbook. All chart worksheets are ignored! (future inclusion will
included a backwards compatibility policy)>
B
B
=back
=head3 worksheet( $name )
=over
B
If no value is passed to $name then the 'next' worksheet in physical order is
returned. I<'next' will NOT wrap> It also only iterates through the 'worksheets'
in the workbook (but not the 'chartsheets').
B
B
B
while( my $worksheet = $workbook->worksheet ){
print "Reading: " . $worksheet->name . "\n";
# get the data needed from this worksheet
}
=back
=head3 in_the_list
=over
B
B
B
=back
=head3 start_at_the_beginning
=over
B
B
B
=back
=head3 worksheet_count
=over
B
B
B
=back
=head3 get_worksheet_names
=over
B
B
B
B
for $sheet_name ( @{$workbook->worksheet_names} ){
my $worksheet = $workbook->worksheet( $sheet_name );
# Read the worksheet here
}
=back
=head3 get_sheet_names
=over
B
B
B
=back
=head3 get_chartheet_names
=over
B
B
B
=back
=head3 sheet_name( $Int )
=over
B
B
B
B
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
=back
=head3 sheet_count
=over
B
B
B
=back
=head3 worksheet_name( $Int )
=over
B
B
B
B
for $x (2..4){
my $worksheet = $workbook->worksheet( $workbook->worksheet_name( $x ) );
# Read the worksheet here
}
=back
=head3 worksheet_count
=over
B
B
B
=back
=head3 chartsheet_name( $Int )
=over
B
B
B
=back
=head3 chartsheet_count
=over
B
B
B
=back
=head3 error
=over
B
B
B
=back
=head2 Secondary Methods
These are the additional methods that include ways to extract additional information about
the .xlsx file and ways to modify workbook and worksheet parsing that are less common.
Note that all methods specifically used to adjust workbook level attributes are listed in
the L<Attribute|/Attribute> section. This section primarily contains methods for or
L<delegated|Moose::Manual::Delegation> from private attributes set up during the workbook
load process.
=head3 parse_excel_format_string( $format_string )
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/parse_excel_format_string( $string )>
=back
=head3 creator
=over
B
B
B
=back
=head3 date_created
=over
B
B
B
=back
=head3 modified_by
=over
B
B
B
=back
=head3 date_modified
=over
B
B
B
=back
=head3 get_epoch_year
=over
B
B
B
=back
=head3 get_shared_string
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::SharedStrings/get_shared_string( $position )>
=back
=head3 get_format_position
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::Styles/get_format_position( $position, [$header] )>
=back
=head3 set_defined_excel_format_list
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/set_defined_excel_format_list>
=back
=head3 change_output_encoding
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::FmtDefault/change_output_encoding( $string )>
=back
=head3 set_cache_behavior
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/cache_formats>
=back
=head3 get_date_behavior
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/datetime_dates>
=back
=head3 set_date_behavior
=over
Roundabout delegation from L<Spreadsheet::XLSX::Reader::LibXML::ParseExcelFormatStrings/datetime_dates>
=back
=head1 FLAGS
The parameter list (attributes) that are possible to pass to ->new is somewhat long.
Therefore you may want a shortcut that aggregates some set of attribute settings that
are not the defaults but wind up being boilerplate. I have provided possible
alternate sets like this and am open to providing others that are suggested. The
flags will have a : in front of the identifier and will be passed to the class in the
'use' statement for consumption by the import method. The flags can be stacked and
where there is conflict between the flag settings the rightmost passed flag setting is
used.
Example;
use Spreadsheet::XLSX::Reader::LibXML v0.34.4 qw( :alt_default :debug );
=head2 :alt_default
This is intended for a deep look at data and skip formatting cells.
=over
B
=over
L<values_only|/values_only> => 1
L<count_from_zero|/count_from_zero> => 0
L<empty_is_end|/empty_is_end> => 1
=back
=back
=head2 :just_the_data
This is intended for a shallow look at data and skip formatting.
=over
B
=over
L<values_only|/values_only> => 1
L<count_from_zero|/count_from_zero> => 0
L<empty_is_end|/empty_is_end> => 1
L<group_return_type|/group_return_type> => 'value'
L<cache_positions|/cache_positions> => 1
L<from_the_edge|/from_the_edge> => 0,
=back
=back
=head2 :just_raw_data
This is intended for a shallow look at raw text and skips all formatting including number formats.
=over
B
=over
L<values_only|/values_only> => 1
L<count_from_zero|/count_from_zero> => 0
L<empty_is_end|/empty_is_end> => 1
L<group_return_type|/group_return_type> => 'unformatted'
L<cache_positions|/cache_positions> => 1
L<from_the_edge|/from_the_edge> => 0,
=back
=back
=head2 :debug
Turn on L<Spreadsheet::XLSX::Reader::LibXML::Error/should_warn> in the Error attribute (instance)
=over
B
=over
L<Spreadsheet::XLSX::Reader::LibXML::Error/should_warn> => 1
=back
=back
=head1 BUILD / INSTALL from Source
B<0.> Please note that using L<cpanm|https://metacpan.org/pod/App::cpanminus> is much easier than a source build! (but it will not always give the latest github version)
cpanm Spreadsheet::XLSX::Reader::LibXML
And then if you feel kindly
cpanm-reporter
B<1.> This package uses L
=over
B
C:\> cpanm PkgConfig --configure-args=--script=pkg-config
=back
It may be that you still need to use a system package manager to L<load|http://xmlsoft.org/> the 'libxml2-devel' library. If this is the case or you experience any other installation issues please L<submit them to github|https://github.com/jandrew/Spreadsheet-XLSX-Reader-LibXML/issues> especially if they occur prior to starting the test suit as these failures will not auto push from CPAN Testers so I won't know to fix them!
B<2.> Download a compressed file with this package code from your favorite source
=over
L<github|https://github.com/jandrew/Spreadsheet-XLSX-Reader-LibXML>
L<Meta::CPAN|https://metacpan.org/pod/Spreadsheet::XLSX::Reader::LibXML>
L<CPAN|http://search.cpan.org/~jandrew/Spreadsheet-XLSX-Reader-LibXML/>
=back
B<3.> Extract the code from the compressed file.
=over
If you are using tar on a .tar.gz file this should work:
tar -zxvf Spreadsheet-XLSX-Reader-LibXML-v0.xx.tar.gz
=back
B<4.> Change (cd) into the extracted directory
B<5.> Run the following
=over
(for Windows find what version of make was used to compile your perl)
perl -V:make
(then for Windows substitute the correct make function (s/make/dmake/g)? below)
=back
perl Makefile.PL
make
make test
make install # As sudo/root
make clean
=head1 SUPPORT
=over
L<github Spreadsheet::XLSX::Reader::LibXML/issues|https://github.com/jandrew/Spreadsheet-XLSX-Reader-LibXML/issues>
=back
=head1 TODO
=over
B<1.> Add POD for all the new chart methods!
B<1.> Build an 'Alien::LibXML::Devel' package to load the libxml2-devel libraries from source and
require that and L
=over
Both libxml2 and libxml2-devel libraries are required for XML::LibXML
=back
B<1.> Add an individual test just for Spreadsheet::XLSX::Reader::LibXML::Row (Currently tested in the worksheet test)
B<2.> Add an individual test just for Spreadsheet::XLSX::Reader::LibXML::ZipReader (Currently only tested in the top level test)
B<3.> Add individual tests just for the File, Meta, Props, Rels sub workbook interfaces
B<4.> Add an individual test just for Spreadsheet::XLSX::Reader::LibXML::ZipReader::ExtractFile
B<5.> Add individual tests just for the XMLReader sub modules NamedStyles, and PositionStyles
B<6.> Add a pivot table reader (Not just read the values from the sheet)
B<7.> Add calc chain methods
B<8.> Add more exposure to workbook/worksheet formatting values
B<9.> Build a DOM parser alternative for the sheets
=over
(Theoretically faster than the reader and no longer JIT so it uses more memory)
=back
=back
=head1 AUTHOR
=over
Jed Lund
jandrew@cpan.org
=back
=head1 CONTRIBUTORS
This is the (likely incomplete) list of people who have helped make this distribution what it is, either via code contributions, patches, bug reports, help with troubleshooting, etc. A huge 'thank you' to all of them.
=over
L<Frank Maas|https://github.com/Frank071>
L<Stuart Watt|https://github.com/morungos>
L<Toby Inkster|https://github.com/tobyink>
L<Breno G. de Oliveira|https://github.com/garu>
L<Bill Baker|https://github.com/wdbaker54>
L<H.Merijin Brand|https://github.com/Tux>
L<Todd Eigenschink|mailto:todd@xymmetrix.com>
L<Alexandr Ciornii|http://chorny.net>
=back
=head1 COPYRIGHT
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
This software is copyrighted (c) 2014, 2016 by Jed Lund
=head1 DEPENDENCIES
=over
L<perl 5.010|perl/5.10.0>
L
L
L
L
L
L
L
L
L
L
L
L
L
L
=back
=head1 SEE ALSO
=over
L
L
L
L
L<Log::Shiras|https://github.com/jandrew/Log-Shiras>
=over
All lines in this package that use Log::Shiras are commented out
=back
=back
=cut