In the FedGaz data, there are many occurencies where one page is duplicated.
FedGaz has a low number of content-items, spanning several pages. It appears that the page switching from one content-item to the next (when 2 content-items are on the same page) the page got duplicated in the data.
The fix for this may be very simple (i.e. removing the duplicates from the lists pp in the content-item's metadata), but it might be interesting to investigate it further.
Example issues with such issues, more are available in this notebook.
In the FedGaz data, there are many occurencies where one page is duplicated. FedGaz has a low number of content-items, spanning several pages. It appears that the page switching from one content-item to the next (when 2 content-items are on the same page) the page got duplicated in the data.
The fix for this may be very simple (i.e. removing the duplicates from the lists
pp
in the content-item's metadata), but it might be interesting to investigate it further.Example issues with such issues, more are available in this notebook.