simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.04k stars 183 forks source link

xml validation fails #244

Open simsong opened 2 years ago

simsong commented 2 years ago
(base) simsong@nimi src % xmllint --valid out-emails1/report.xml|head -10                                                                          (slg-dev)bulk_extractor
out-emails1/report.xml:2: validity error : Validation failed: no DTD found !
<dfxml xmloutputversion='1.0'>
                             ^
<?xml version="1.0" encoding="UTF-8"?>
<dfxml xmloutputversion="1.0">
  <metadata xmlns="http://afflib.org/bulk_extractor/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:type>Feature Extraction</dc:type>
  </metadata>
  <creator version="1.0">
    <program>BULK_EXTRACTOR</program>
    <version>2.0.0-dev</version>
    <build_environment>
      <compiler>4.2.1 (Apple LLVM 12.0.5 (clang-1205.0.22.11))</compiler>
(base) simsong@nimi src %                                                                                                                          (slg-dev)bulk_extractor
simsong commented 2 years ago

Apparently I need a DTD. Perhaps @ajnelson-nist can help.

ajnelson-nist commented 2 years ago

The DFXML schema can be used to validate DFXML, though it needs to use the --schema flag, not the --valid flag. The Python code base's samples Makefile demonstrates this. I would recommend tracking the schema as a Git submodule, at the version where you want it to validate.

You may also be in for a bit of a data upgrade, as the DFXML schema identified many long-standing issues with the way DFXML was originally drafted. For one thing, namespaces are scoped to the element they're attached to, so your sample has no namespace to which it's claiming to conform. See Differencing test 0 for how to declare a <dfxml> element as in the DFXML namespace.

simsong commented 2 years ago

Well, you are now the XML/DFXML expert. If you could give me a sample of how to add namespace other other scoping tags, I'll update bulk_extractor2.0 so that it produces conformant DFXML.

simsong commented 2 years ago

@ajnelson-nist - I think that I'm making progress on this. Now the validation errors apparently require that I do an update to the DFXML schema or create my own namespace.

Here is the new head of the DFXML output of bulk_extractor:

<?xml version='1.0' encoding='UTF-8'?>
<dfxml version='1.0' xmlns='http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML'
  xmlns:debug='http://afflib.org/bulk_extractor/debug'
  xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
  xmlns:dc='http://purl.org/dc/elements/1.1/'>
  <metadata>
    <dc:type>Feature Extraction</dc:type>
  </metadata>
  <creator version='1.0'>
    <program>BULK_EXTRACTOR</program>
    <version>2.0.0-dev</version>
    <build_environment>
...

And here is what happens when I try to validate it:

% xmllint --noout --schema dfxml.xsd out-domexusers-be20v3/report.xml                                                                                                                                        (slg-dev)bulk_extractor
warning: failed to load external entity "ref/dc.xsd"
dfxml.xsd:34: element import: Schemas parser warning : Element '{http://www.w3.org/2001/XMLSchema}import': Failed to locate a schema at location 'ref/dc.xsd'. Skipping the import.
warning: failed to load external entity "ref/xml.xsd"
dfxml.xsd:43: element import: Schemas parser warning : Element '{http://www.w3.org/2001/XMLSchema}import': Failed to locate a schema at location 'ref/xml.xsd'. Skipping the import.
out-domexusers-be20v3/report.xml:14: element CPPFLAGS: Schemas validity error : Element '{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}CPPFLAGS': This element is not expected. Expected is one of ( {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}compilation_date, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}library ).
out-domexusers-be20v3/report.xml:25: element cpuid: Schemas validity error : Element '{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}cpuid': This element is not expected.
out-domexusers-be20v3/report.xml:49: element configuration: Schemas validity error : Element '{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}configuration': This element is not expected. Expected is one of ( {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}source, ##other{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}*, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}diskimageobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}partitionsystemobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}partitionobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}volume, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}fileobject, {http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}rusage, ##other{http://www.forensicswiki.org/wiki/Category:Digital_Forensics_XML}* ).
out-domexusers-be20v3/report.xml fails to validate
%

I guess dc: is Dublin Core, so I will need to get a Dublin Core xsd file somewhere.

I'm not sure what xsi: is about. Any clue?