Open tcatapano opened 8 months ago
Removed 1 large (100M+) file.
Errors in 63 of 499 articles. 12.6%
top errs by frequency:
1480 Unexpected element "tp:material-citation". The content of the parent element type must match "(named-content|tp
133 Unexpected element "tp:material-citation". The content of the parent element type must match "(tp
18 Unexpected character data "
13 Unexpected element "p". The content of the parent element type must match "(email|ext-link|uri|inline-supplementary-material|related-article|related-object|address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|citation-alternatives|element-citation|mixed-citation|nlm-citation|bold|fixed-case|italic|monospace|overline|roman|sans-serif|sc|strike|underline|ruby|award-id|funding-source|open-access|chem-struct|inline-formula|inline-graphic|inline-media|private-char|def-list|list|tex-math|mml:math|abbrev|milestone-end|milestone-start|named-content|styled-content|tp
11 The content of element type "kwd-group" is incomplete, it must match "(label?,title?,(kwd|compound-kwd|nested-kwd)+)".
9 Unexpected element "title". The content of the parent element type must match "(sec-meta?,((label,title?)|title),(address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|def-list|list|tex-math|mml:math|p|related-article|related-object|disp-quote|speech|statement|verse-group)*,(sec|tp
9 The markup in the document following the root element must be well-formed.
9 Attribute "xmlns:tp" is not allowed to appear in element "journal-meta".
7 Unexpected element "sec". The content of the parent element type must match "(sec-meta?,label?,title?,(address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|def-list|list|tex-math|mml:math|p|related-article|related-object|disp-quote|speech|statement|verse-group)*,tp
5 Unexpected element "tp:treatment-sec". The content of the parent element type must match "((address|alternatives|answer|answer-set|array|block-alternatives|boxed-text|chem-struct-wrap|code|explanation|fig|fig-group|graphic|media|preformat|question|question-wrap|question-wrap-group|supplementary-material|table-wrap|table-wrap-group|disp-formula|disp-formula-group|def-list|list|tex-math|mml
top errs per file:
620 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/BiodivDatJour_12__e117362_tp.xml
540 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/BiodivDatJour_12__e115051_tp.xml
319 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/BiodivDatJour_12__e120292_tp.xml
42 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.2.455-494.pdf_tp.xml
19 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.92.1.15-86.pdf_tp.xml
17 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.1.9-35.pdf_tp.xml
16 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/AmMusNovit.2024.4009.1-47.pdf_tp.xml
12 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.91.1.7-32.pdf_tp.xml
9 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/InsectaMundi.2024.1036.1-31.pdf_tp.xml
8 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/ZoolAnz.56.264-281.imf_tp.xml
8 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.95.1.23-73.pdf_tp.xml
8 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/InsectaMundi.2024.1037.1-16.pdf_tp.xml
7 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.1.307-349.pdf_tp.xml
6 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/SoilOrg.90.2.57-70.pdf_tp.xml
6 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/LinzerbiolBeitr.55.1.47-60.pdf_tp.xml
4 System ID: /Users/thc4/Github/ggxml2taxpub/level1/articles/sample_500/PakistanJNematol.37.2.171-243.pdf.imf_tp.xml
sample contains several small files which are likely metadata only:
8.0K /Users/thc4/working/articles_sample_500/phytotaxa.637.3.10.pdf.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_2___444-449.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_2___279-443.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_2___268-278.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_2___258-267.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_2___249-257.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_1___242-248.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_1___233-241.xml
8.0K /Users/thc4/working/articles_sample_500/CheckList_20_1___227-232.xml
4.0K /Users/thc4/working/articles_sample_500/EJT.2024.923.1-119.pdf.imf.xml
of latest 500 processed articles