noirello / pyorc

Python module for Apache ORC file format
Apache License 2.0
64 stars 21 forks source link

Rec skips during sequencial reads #47

Closed robrice closed 2 years ago

robrice commented 2 years ago

Not sure if this is a result of deleted records in my ORC file, but, using the AWS planet.orc (Open Steet Maps) file, during sequenctial reads or each row, the first columns (row[0]) does not increment by 1 in all cases. Here is a sample output from my script - the rec: value is from the ORC data[0] and the "recs:" value is simply an incremented counter.

Opening ORC Source: s3://osm-pds/planet/planet-latest.orc Reading ORC data Schema: struct<id:bigint,type:string,tags:map<string,string>,lat:decimal(9,7),lon:decimal(10,7),nds:array<struct>,members:array<struct<type:string,ref:bigint,role:string>>,changeset:bigint,timestamp:timestamp,uid:bigint,user:string,version:bigint,visible:boolean> Stripes: 3085 Rows: 8222531885 Lengths: {'content_length': 88272336575, 'file_footer_length': 47320, 'file_postscript_length': 27, 'file_length': 88273129466, 'stripe_statistics_length': 745543} Stripes 3085 Stripe Len 3085 Test SB True: True Test SB True: True Test SB False: False Rec:1 Recs:1 Time:2021-07-10 00:53:52+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:0 Rec:2 Recs:2 Time:2021-05-18 09:03:40+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:1 Rec:3 Recs:3 Time:2021-10-19 17:46:10+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:2 Rec:10 Recs:4 Time:2020-04-13 13:27:47+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:3 Rec:54 Recs:5 Time:2021-09-20 09:43:13+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:4 Rec:100 Recs:6 Time:2021-01-06 19:54:29+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:5 Rec:110 Recs:7 Time:2018-07-21 22:01:43+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:6 Rec:111 Recs:8 Time:2020-03-29 17:27:08+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:7 Rec:112 Recs:9 Time:2016-09-18 15:36:55+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:8 Rec:113 Recs:10 Time:2018-07-21 22:01:43+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:9 Rec:114 Recs:11 Time:2021-01-05 21:02:32+00:00 Nwrite:0 Wwrite:0 Typeo:0 NegWay:0, Outside:0 Ntime:10

noirello commented 2 years ago

It doesn't look like that it's skipping any rows. That's truly the file's content. Got the same result using Java orc-tools:

$ orc-tools data planet-latest.orc
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Processing data file planet-latest.orc [length: 88431268984]
{"id":1,"type":"node","tags":[{"_key":"communication:microwave","_value":"yes"},{"_key":"communication:radio","_value":"fm"},{"_key":"man_made","_value":"mast"},{"_key":"name","_value":"Monte Piselli - San Giacomo"},{"_key":"description","_value":"Radio Subasio"},{"_key":"tower:construction","_value":"lattice"},{"_key":"frequency","_value":"105.5 MHz"},{"_key":"tower:type","_value":"communication"}],"lat":"42.7957187","lon":"13.5690032","nds":[],"members":[],"changeset":115176312,"timestamp":"2021-12-20 17:54:50.0","uid":9475258,"user":"Aranc","version":27,"visible":true}
{"id":2,"type":"node","tags":[{"_key":"image","_value":"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/d\/d3\/Pulkovo_meridian_zero-point.jpg"},{"_key":"historic","_value":"memorial"},{"_key":"man_made","_value":"survey_point"},{"_key":"name","_value":"Центр Круглого Зала (ЦКЗ)"},{"_key":"wikipedia","_value":"ru:Пулковский меридиан"},{"_key":"wikidata","_value":"Q4383612"}],"lat":"59.7717926","lon":"30.32611","nds":[],"members":[],"changeset":104876400,"timestamp":"2021-05-18 09:03:40.0","uid":571410,"user":"Антін Сартенченко","version":31,"visible":true}
{"id":3,"type":"node","tags":[{"_key":"amenity","_value":"kindergarten"},{"_key":"name","_value":"El Duende Rosado"},{"_key":"preschool","_value":"yes"}],"lat":"-34.8140751","lon":"-58.4899743","nds":[],"members":[],"changeset":112710145,"timestamp":"2021-10-19 17:46:10.0","uid":9946929,"user":"lsesntion","version":11,"visible":true}
{"id":10,"type":"node","tags":[{"_key":"alt_name","_value":"Mamacita"},{"_key":"name","_value":"Mamassita"},{"_key":"place","_value":"village"},{"_key":"source","_value":"Info of imported schools by http:\/\/wiki.openstreetmap.org\/wiki\/Import_MALI_UNICEF_Education;Bing"}],"lat":"14.2769353","lon":"-11.0519163","nds":[],"members":[],"changeset":83479649,"timestamp":"2020-04-13 13:27:47.0","uid":5568154,"user":"edvac_reverts","version":12,"visible":true}
{"id":54,"type":"node","tags":[{"_key":"name:oc","_value":"Novosibirsk"},{"_key":"capital","_value":"3"},{"_key":"source:name:oc","_value":"Lo Congrès"},{"_key":"source:population","_value":"https:\/\/novosibstat.gks.ru\/storage\/mediabank\/p54_PRESS79_2020.pdf"},{"_key":"name:sk","_value":"Novosibirsk"},{"_key":"name:uk","_value":"Новосибірськ"},{"_key":"official_status","_value":"ru:город"},{"_key":"name:sl","_value":"Novosibirsk"},{"_key":"name:ca","_value":"Novossibirsk"},{"_key":"name:kn","_value":"ನೋವೋಸಿಬಿರ್ಸ್ಕ್"},{"_key":"name:ko","_value":"노보시비르스크"},{"_key":"old_name:fr","_value":"Novonikolaïevsk"},{"_key":"admin_level","_value":"3"},{"_key":"name:en","_value":"Novosibirsk"},{"_key":"place","_value":"city"},{"_key":"wikipedia","_value":"ru:Новосибирск"},{"_key":"name:et","_value":"Novosibirsk"},{"_key":"name:cs","_value":"Novosibirsk"},{"_key":"name:es","_value":"Novosibirsk"},{"_key":"name:zh","_value":"新西伯利亚"},{"_key":"start_date","_value":"1893"},{"_key":"name:ar","_value":"نوفوسيبيرسك"},{"_key":"old_name:fr:1893-1925","_value":"Novonikolaïevsk"},{"_key":"name:ja","_value":"ノヴォシビルスク"},{"_key":"old_name:ru","_value":"Ново-Николаевск"},{"_key":"name:pl","_value":"Nowosybirsk"},{"_key":"name:da","_value":"Novosibirsk"},{"_key":"name:he","_value":"נובוסיבירסק"},{"_key":"name:ro","_value":"Novosibirsk"},{"_key":"population","_value":"1620162"},{"_key":"name:be","_value":"Новасібірск"},{"_key":"name:fi","_value":"Novosibirsk"},{"_key":"name:ru","_value":"Новосибирск"},{"_key":"name:pt","_value":"Novosibirsk"},{"_key":"name:de","_value":"Nowosibirsk"},{"_key":"name:hi","_value":"नोवोसिबिर्स्क"},{"_key":"old_name:en","_value":"Novo-Nikolayevsk"},{"_key":"name:lt","_value":"Novosibirskas"},{"_key":"old_name","_value":"Ново-Николаевск"},{"_key":"name","_value":"Новосибирск"},{"_key":"name:fr","_value":"Novossibirsk"},{"_key":"population:date","_value":"2021-01-01"},{"_key":"name:hr","_value":"Novosibirsk"},{"_key":"name:lv","_value":"Novosibirska"},{"_key":"wikidata","_value":"Q883"}],"lat":"55.0282171","lon":"82.9234509","nds":[],"members":[],"changeset":111440805,"timestamp":"2021-09-20 09:43:13.0","uid":3786091,"user":"nicolarus","version":24,"visible":true}
{"id":100,"type":"node","tags":[{"_key":"historic","_value":"memorial"},{"_key":"description","_value":"Weltkriegsdenkmal"},{"_key":"memorial","_value":"war_memorial"}],"lat":"52.8916184","lon":"10.8340913","nds":[],"members":[],"changeset":97068748,"timestamp":"2021-01-06 19:54:29.0","uid":3277230,"user":"Polarbear-repair","version":13,"visible":true}
{"id":110,"type":"node","tags":[],"lat":"59.9499101","lon":"10.783415","nds":[],"members":[],"changeset":60940511,"timestamp":"2018-07-21 22:01:43.0","uid":207581,"user":"Hjart","version":6,"visible":true}
{"id":111,"type":"node","tags":[],"lat":"59.9475022","lon":"10.7875238","nds":[],"members":[],"changeset":82786301,"timestamp":"2020-03-29 17:27:08.0","uid":538387,"user":"forteller","version":7,"visible":true}
{"id":112,"type":"node","tags":[{"_key":"barrier","_value":"block"},{"_key":"bicycle","_value":"yes"}],"lat":"59.9515081","lon":"10.7854259","nds":[],"members":[],"changeset":42249840,"timestamp":"2016-09-18 15:36:55.0","uid":3497659,"user":"NKA","version":8,"visible":true}
{"id":113,"type":"node","tags":[],"lat":"59.9487442","lon":"10.7819317","nds":[],"members":[],"changeset":60940511,"timestamp":"2018-07-21 22:01:43.0","uid":207581,"user":"Hjart","version":5,"visible":true}
{"id":114,"type":"node","tags":[],"lat":"59.9506757","lon":"10.784339","nds":[],"members":[],"changeset":97005326,"timestamp":"2021-01-05 21:02:32.0","uid":3277230,"user":"Polarbear-repair","version":6,"visible":true}
{"id":115,"type":"node","tags":[],"lat":"59.9510531","lon":"10.7796921","nds":[],"members":[],"changeset":97005326,"timestamp":"2021-01-05 21:02:32.0","uid":3277230,"user":"Polarbear-repair","version":5,"visible":true}
{"id":116,"type":"node","tags":[],"lat":"59.9525466","lon":"10.7781722","nds":[],"members":[],"changeset":97005326,"timestamp":"2021-01-05 21:02:32.0","uid":3277230,"user":"Polarbear-repair","version":5,"visible":true}
{"id":117,"type":"node","tags":[],"lat":"59.9515953","lon":"10.7813416","nds":[],"members":[],"changeset":97005326,"timestamp":"2021-01-05 21:02:32.0","uid":3277230,"user":"Polarbear-repair","version":4,"visible":true}
robrice commented 2 years ago

That is interesting - thanks very much for verifying that!