Some design idead for the new NGSI connector

fgalan commented 10 years ago

Based on our previous experience with the ngsi2cosmos prototype, this issue describe some design ideas to define the evolution of this component.

Although the interface towards Orion Context Broker doesn't change (i.e. it will be based on notifyContextRequest, produced by a subscription at Orion), the new version of the component will allow different persistence backends, that can be used simultaneously (i.e. the process could be configured to persist each notified context element to both a Cosmos backend and a CKAN backend, side note: using several backends at the same time introduces the issue of transactionality, I'd suggest to keep first version best-effort and simple and doesn't take into account transactionality). In fact, we should think in a modular and extensible approach, so new backends could be added in the future. At the present moment the list will be Cosmos, MongoDB and CKAN.

Thus, the name "ngsi2cosmos" should be changed for something more general, e.g. "ngsi_connector" (short but not precise, as connector may refer to input or output), or "ngsi_output_connector" (more precise, but long). Any idea? :)

Next, the command line should be improved (ngsi2cosmos relies on putting the argument in one exact order, combining mandatory ones with optional ones... that is a mess!). The command line could be something very simple, with just 3 parameters:

-c , path for the configuration file (by default, if this is omitted a default value could be used, e.g. the file named "ngsi_injector.conf" in the current directory).
-u, for printing the usage message
-v, for printing version (side note: we need to define some kind of packaging mechanism, to keep control of the different versions of the program, in the same way Orion has; has Python any facility to do this?).

Optionally, we could consider the same approach used by nova_event_listener: parameters specified in the conf file but with the possibility to be overridden by the same parameter name in the command line (makes sense for the dictionary-based format, but not sure in the case of JSON-based format, see discussion on this below).

Regarding the configuration file, it could be structured in a common section, for information that applies to all the possible backends and per-backend type sections (e.g. a section for Cosmos, a section for MongoDB). The way of enabling a particular backend would be including a corresponding section in the configuration file (in which case, the particular information in that section could be used).

What format to use in the configuration file? I think that two approaches are possible: dictionary-based (as the one used by nova_event_listener) or JSON-based. The advantage of the later versus the former could be the possibility of including structured information easily, that, for some configuration pieces could be needed (e.g. solving the functionality at https://github.com/telefonicaid/fiware-livedemoapp/issues/4). On the other hand, the advantage of dictionary-based is a more clear syntax (maybe the trend will change in the future, but I think that nowadays there is more people used to dictionary-based configuration files than to JSON-based configuration files).

fgalan commented 10 years ago

Reference: https://github.com/pratid/fiware-monitoring/tree/master/nova_event_listener

pratid commented 10 years ago

My two cents:

1) We should try to follow recommended directory structure for Python. This affects the preferred location of configuration files at "conf/" directory (then -c option could be unnecessary).

2) There are two reasons for using dictionary-based configuration files. On the one hand, to take advantage of the ConfigParser class. On the other hand, to take advantage of the logging.config module in order to setup logging handlers, formatters, etc. in a single configuration and user-modifiable file.

3) About sections within configuration file... good approach.

frbattid commented 10 years ago

1) Regarding the name: in Big Data world, this kind of software is commonly called "data injector", thus I propose ngsiInjector.

2) Fermin said: i.e. the process could be configured to persist each notified context element to both a Cosmos backend and a CKAN backend.

If I've understood well, the same process will be persisting context elements in two or more backends, right? When I was thinking on it, I thought on a software that could be installed very "close" (conceptually or physically) to the backend, e.g. in the same machine where Orion or the Cosmos' master node run, or a machine belonging to the backend ecosystem. In other words, I was thinking on a set of injectors running independently, which avoids having a single point of failure, scales better, etc.

Perhaps, the modularity that Fermín talks about can be achieved by defining those specialized injectors that can be deployed individually, and allowing federating them (if required) thanks to a super-injector acting as a proxy/hub.

NOTE: If the injectors are splitted, maybe they should be called ngsiCosmosInjector, ngsiMongoInjector, ngsiCKANInjector and ngsiSuperInjector :-)

3) The subscription part of the process I think should be automatized, the same way the current ngsi2cosmos creates (if not existing) both the HDFS folder and the Hive table each time it is run. The "if not existing" remark is important... is it possible to know if a software has already subscribed for certain context information? This is in order to avoid duplicated subscriptions.

fgalan commented 10 years ago

@pratid

We will use the reference directory structure you mention. However, this doesn't preclude the use of -c option (e.g. consider that we run several ngsi_injector processes in the same VM, each one using a different configuration file).
Good points in favour of dictionary style. Next, I will try to make a draft for the config file using dictionary style, to check if it suffices all the expressibility requirements we have on the table right now.

fgalan commented 10 years ago

@frbattid

ngsiInjector sounds good :) However, not sure if the spelling is correct considering Python style guides. I guess that it should be ngsi_injector, although I'm not 100% sure right now.
Note that with a multi-backend injector you can always run it configured for using just 1 backend, so the case you propose of having the injector closer to the destination as possible is possible. However, if the injector is designed in a mono-backend way, the only way of realizing multi-backend is to define several processes, each one listeting to a different port and involving a diferent subscription at Orion. Thus, in other words, the multi-backend approach is flexible enough to allow both cases. Regarding the hub/proxy idea, it would involve to add a new "protocol" between the hub and de "local injectors". I don't think is a good idea, it adds another layer of protocol (e.g. from NGSI->HttpFS to NGSI->new_protocol->HttpFs, a bit weird).
NGSI doesn't allow to query for existing subscriptions, so this is hard to implement right now. However, this is not the first time this necessity is discussed, it could be an interesting feature the possibility to get this information using some non-NGSI administrative API (an issue in fiware-orion has been created to research on this: https://github.com/telefonicaid/fiware-orion/issues/232). However, no matter if that mechanism exist or not, ngsi_injector could have a true/flag flag activated to make subscriptions at boot time (the user will choose if using it or not, based on his/her knowledge of subscriptions in Orion). Moreover, as helper tool related with ngsi_injector it would be interesting to have a command line tool to easy subscriptions, e.g. ngsi_injector_subscribe.py "NODE.* TimeInstant. Edit: actually, we have an existing issue related with this at https://github.com/telefonicaid/fiware-livedemoapp/issues/11

frbattid commented 10 years ago

From a conversation with @sergg75:

Playing with Apache Flume a few days ago I realized how easy it was to gather and persist Twitter data in Cosmos. Basically, Apache Flume is a data ingestion system that is configured by defining endpoints in a data flow called sources and sinks. There are some standard sources and sinks available in a library; among them we can find a HDFS sink (http://flume.apache.org/FlumeUserGuide.html#hdfs-sink) which is used in the Twitter data gathering by configuring a couple of properties in a config file:

TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:9000/user/flume/tweets/ TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0 TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000

The source endpoint in the Twitter use case is also sonfigured in the config file, nevertheless it is not a standard source but a custom one (https://github.com/cloudera/cdh-twitter-example/tree/master/flume-sources).

Said that, Sergio and I were wondering if it makes sense to develop our injectors/connectors as Flume components. Advantages are:

Flume is almost an standard in the data world. We could win in interoperability, adoption, etc.
Flume is widely used and, therefore, tested tool. Apache Flume authors highlight its reliability and recoverability characteristics.
Complex architectures such as consolidating data from multiple sources in one point are easy to achieve.
On the contrary, data multiplexion is a native concept in Flume. This could be useful if wanting to spread a context event among a Mongo DB, CKAN and Cosmos at the same time.
It allows for on-the-fly processing (e.g. by defining a special sink that only processes).
It allows for "Storm-like" topologies/pipelines of processing nodes.

Starting with the ngsi2cosmos, the notification part of the code can be packaged as a new ngsi-based source; while the persisting part could be done using the standard HDFS sink (if not, a new sink can be defined).

fgalan commented 10 years ago

Using Flume sounds good. I think it could ease a lot the development and (more important) further maintenance. However, we should consider also how easy/difficult is to set up the NGSI to Cosmos injection compared with existing ngsi2cosmos. I mean, currently it is just "to start a process from the CLI and that's all". Is so simple also with the Flume approach? How complex is installing the Flume framework needed to run the injection?

Having said that, I think it makes sense to have a PoC with the NGSI to Cosmos injection using Flume component, including:

Develop the needed code.
Install it in FI-LAB testbed (replacing the current ngsi2cosmos processes in orion.lab) to check that the solution is equivalent from a functional point of view.
Documenting the process to install and use it from the perspective of a developer. With that piece of documentation we can evaluate how simple/difficult is, compared with the current ngsi2cosmos.

frbattid commented 10 years ago

It is quite easy: Flume installation is about download a tar.gz and move the untared folder to the desired location. Then, a configuration file must be created. If non-native Flume sources or sinks are needed (such as some of our injectors may require), they must be downloaded and copied to apache-flume/lib. Then, a command is executed.

+1 to have a PoC based in ngsi2cosmos.

fgalan commented 10 years ago

We have finalized the PoC step (merged into develop branch in PR #8): currently we have replaced the ngsi2cosmos.py processes in orion.lab and after a day of complete operation, it seems to work fine and functionally equivalente. Success!

Thus, we should consider the next steps. I think there are two lines to continue the work, one related with consolidate the PoC prototype into a full fledged FI-WARE component, the other about "exploring" new aspects

In the "product" track:
- Provide a name for the component ( @frbattid and me have been discussing some ideas arround :)
- Complement the current code with unit test covering its functionality (and consider unit test a must each time a new PR is produced)
- Implement a versioning system, so the .jar get marked with a proper version instead of using always the same "1.0-SNAPSHOT" and a changelog between version can be mantained.
In the "experimentation" track:
- Implement another sink for MongoDB and test it along with the Cosmos sink, e.g. run the program with 2 sinks activated and check that when a notifyContextRequest arrives it is persisted in both places, check if Flume can implement "all or nothing" semantics (i.e. if one of the sinks is offline, for instance the connection to MongoDB is broken, then the notifyContextRequest is not persisted in Cosmos), etc.

Maybe the different items could be "fragmented" into different issues, to address them at different pace.

fgalan commented 10 years ago

The discussion has matured in Cygnus so this issue has achieved its purpose and it is no longer needed. Regarding the point mentioned in the last comment, specific issue has been created for that:

telefonicaid / fiware-cygnus

Some design idead for the new NGSI connector #1