turbomam / biosample-xmldb-sqldb

Tools for loading NCBI Biosample into an XML database and then transforming that into a SQL database
MIT License
0 stars 1 forks source link

"/srv/basex/shared-chunks/biosample_set_from_37000001.xml" (Line 20346097): XML document structures must start and end within the same entity. #21

Closed turbomam closed 6 months ago

turbomam commented 6 months ago
head shared-chunks/biosample_set_from_37000001.xml
<?xml version="1.0" encoding="UTF-8"?>
<BioSampleSet>
<BioSample access="public" publication_date="2023-12-27T00:00:00.000" last_update="2023-12-27T14:33:14.303" submission_date="2023-12-27T14:10:10.597" id="39145858" accession="SAMN39145858">
  <Ids>
    <Id db="BioSample" is_primary="1">SAMN39145858</Id>
    <Id db_label="Sample name">CO-CDPHE-41544812</Id>
    <Id db="SRA">SRS20005259</Id>
  </Ids>
  <Description>
    <Title>PCR tiled amplicon WGS of SARS-CoV-2</Title>
tail shared-chunks/biosample_set_from_37000001.xml
    <Attribute attribute_name="geo_loc_name" harmonized_name="geo_loc_name" display_name="geographic location">USA</Attribute>
    <Attribute attribute_name="lat_lon" harmonized_name="lat_lon" display_name="latitude and longitude">missing</Attribute>
    <Attribute attribute_name="host" harmonized_name="host" display_name="host">missing</Attribute>
    <Attribute attribute_name="host_disease" harmonized_name="host_disease" display_name="host disease">missing</Attribute>
  </Attributes>
  <Links>
    <Link type="entrez" target="bioproject" label="PRJNA230403">230403</Link>
  </Links>
  <Status status="live" when="2024-02-12T02:56:05.936"/>
</BioSample>
turbomam commented 6 months ago

Failed to add </BioSampleSet> to last chunk

turbomam commented 6 months ago

temporary workaround

echo "</BioSampleSet>" >> shared-chunks/biosample_set_from_37000001.xml
turbomam commented 6 months ago

then move loaded xml files into a different folder and re-run

make load-biosample-sets all-ncbi-attributes-long-file non-attribute-metadata-file postgres-all