plazi / ggxml2taxpub

Conversion of GoldenGATE XML to JATS/TaxPub at treatment level
0 stars 1 forks source link

run conversion on 500 article sample #73

Open tcatapano opened 4 months ago

tcatapano commented 4 months ago

of latest 500 processed articles

tcatapano commented 4 months ago

Removed 1 large (100M+) file.

Errors in 63 of 499 articles. 12.6%

tcatapano commented 4 months ago

top errs by frequency:

1480  Unexpected element "tp:material-citation". The content of the parent element type must match "(named-content|tp
 133  Unexpected element "tp:material-citation". The content of the parent element type must match "(tp
  18  Unexpected character data "
  13  Unexpected element "p". The content of the parent element type must match "(email|ext-link|uri|inline-supplementary-material|related-article|related-object|address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|citation-alternatives|element-citation|mixed-citation|nlm-citation|bold|fixed-case|italic|monospace|overline|roman|sans-serif|sc|strike|underline|ruby|award-id|funding-source|open-access|chem-struct|inline-formula|inline-graphic|inline-media|private-char|def-list|list|tex-math|mml:math|abbrev|milestone-end|milestone-start|named-content|styled-content|tp
  11  The content of element type "kwd-group" is incomplete, it must match "(label?,title?,(kwd|compound-kwd|nested-kwd)+)".
   9  Unexpected element "title". The content of the parent element type must match "(sec-meta?,((label,title?)|title),(address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|def-list|list|tex-math|mml:math|p|related-article|related-object|disp-quote|speech|statement|verse-group)*,(sec|tp
   9  The markup in the document following the root element must be well-formed.
   9  Attribute "xmlns:tp" is not allowed to appear in element "journal-meta".
   7  Unexpected element "sec". The content of the parent element type must match "(sec-meta?,label?,title?,(address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|def-list|list|tex-math|mml:math|p|related-article|related-object|disp-quote|speech|statement|verse-group)*,tp
   5  Unexpected element "tp:treatment-sec". The content of the parent element type must match "((address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|def-list|list|tex-math|mml
tcatapano commented 4 months ago

top errs per file:

See: https://github.com/plazi/ggxml2taxpub/blob/dd62e2ade37250a1a28cd586415d28bea4bc01ec/errs/sample_500_errors_20240312_per-article.txt

620 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/BiodivDatJour_12__e117362_tp.xml
 540 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/BiodivDatJour_12__e115051_tp.xml
 319 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/BiodivDatJour_12__e120292_tp.xml
  42 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.2.455-494.pdf_tp.xml
  19 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.92.1.15-86.pdf_tp.xml
  17 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.1.9-35.pdf_tp.xml
  16 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/AmMusNovit.2024.4009.1-47.pdf_tp.xml
  12 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.91.1.7-32.pdf_tp.xml
   9 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/InsectaMundi.2024.1036.1-31.pdf_tp.xml
   8 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/ZoolAnz.56.264-281.imf_tp.xml
   8 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.95.1.23-73.pdf_tp.xml
   8 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/InsectaMundi.2024.1037.1-16.pdf_tp.xml
   7 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.1.307-349.pdf_tp.xml
   6 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.90.2.57-70.pdf_tp.xml
   6 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.1.47-60.pdf_tp.xml
   4 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/PakistanJNematol.37.2.171-243.pdf.imf_tp.xml
tcatapano commented 4 months ago

sample contains several small files which are likely metadata only:

8.0K    /Users/thc4/working/articles_sample_500/phytotaxa.637.3.10.pdf.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_2___444-449.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_2___279-443.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_2___268-278.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_2___258-267.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_2___249-257.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_1___242-248.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_1___233-241.xml
8.0K    /Users/thc4/working/articles_sample_500/CheckList_20_1___227-232.xml
4.0K    /Users/thc4/working/articles_sample_500/EJT.2024.923.1-119.pdf.imf.xml