Closed frbattid closed 9 years ago
As discussed offline with @elenatid, another file format could be CSV, both row and column.
Regarding the CSV persistence, a problem arise when dealing with the attribute's metadata.
Metadata is a field that cannot be easily serialized as CSV because its value may be an array of objects. Thus, the CSV records may be result with different lengths, and this may be a problem for Hive tables.
The following solution was discussed with @elenatid: to link from the metadata field to a new HDFS file containing a new record for each object within the array. This link will be simply the name of the file, for instance:
# content of /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt:
23-05-2015T06:37:41,1230923875,room1,room,temperature,float,23.5,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
# content of /user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
1230923875,timestamp_meassure,long,1230923874
1230923875,timestamp_sent,logn,1230923875
The recvTimeTs
value within /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt
is used as a key for finding the metadata associated to each record. I mean, if a second update is notified then we will have:
# content of /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt:
23-05-2015T06:37:41,1230923875,room1,room,temperature,float,23.9,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
23-05-2015T06:37:51,1230933870,room1,room,temperature,float,24.1,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
# content of /user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
1230923875,timestamp_meassure,long,1230923874
1230923875,timestamp_sent,long,1230923875
1230933870,timestamp_meassure,long,1230933869
1230933870,timestamp_sent,long,1230933870
As said, this works if the metadata length changes:
# content of /usr/frb/def_serv/def_serv_path/room1_room/room1_room.txt:
23-05-2015T06:37:41,1230923875,room1,room,temperature,float,23.9,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
23-05-2015T06:37:51,1230933870,room1,room,temperature,float,24.1,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
23-05-2015T06:38:01,1230943871,room1,room,temperature,float,24.2,/user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
# content of /user/frb/def_serv/def_serv_path/room1_room_temperature_md/room1_room_temperature_md.txt
1230923875,timestamp_meassure,long,1230923874
1230923875,timestamp_sent,long,1230923875
1230933870,timestamp_meassure,long,1230933869
1230933870,timestamp_sent,long,1230933870
1230943871,timestamp_meassure,long,1230943870
1230943871,timestamp_sent,long,1230943871
1230943871,verified,boolean,true
For any other attribute, we would have /user/frb/def_serv/def_serv_path/room1_room_<attr_name>_md/room1_room_<attr_name>_md.txt
.
The mechanism is valid both for row and column mode.
Implemented in PR https://github.com/telefonicaid/fiware-cygnus/pull/492
Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. (http://parquet.incubator.apache.org/)
Current
cygnusagent.sinks.hdfs-sink.attr_persistence
parameter can be extended not only for working withrow
andcolumn
butparquet
as well.Even, the
row
andcolumn
modes could be renamed tojson-row
andjson-column
respectively. Thecygnusagent.sinks.hdfs-sink.attr_persistence
colud be also renamed ascygnusagent.sinks.hdfs-sink.file_format
.These changes would allow for further future file formats.