openbudgets / pipeline-fragments

Reusable fragments of LinkedPipes ETL pipelines
2 stars 3 forks source link

FDPtoRDF: scalability #11

Closed marek-dudas closed 7 years ago

marek-dudas commented 8 years ago

The pipeline can take an hour or so to process a 20MB .csv (on Core i7, 8GB RAM). I know that it was developed rather quickly and with focus on correct output and not scalability, so some optimization could probably be achieved by changing its structure. However, maybe @jakubklimek or someone else would have some general optimization tips for LinkedPipes? Changes will be neccessary, as the pipeline has to be able to deal with hundreds of MB in one file.

jakubklimek commented 8 years ago

There are no generic tips :) but we can definitely spend some time optimizing the pipeline structure.

pwalsh commented 8 years ago

To be honest, if it takes one hour 'or so' to process a 20MB text file, 'scalability' is not likely the problem, but some serious issue in the design of the pipeline itself or the underlying framework.

jindrichmynarz commented 8 years ago

We recently discussed performance optimizations in LP-ETL with @jakubklimek because of the pipeline for ESF CZ 2007-2013 projects. This dataset is quite large (1 GB in Turtle) and its pipeline sufficiently complex. The execution of this dataset's pipeline took several days, often running out of memory or having problems with too much writes on disk.

What helped to improve the runtime significantly was merging instances of SPARQL Update components into a single component that uses multiple SPARQL Update operations separated by semicolons. See the SPARQL 1.1 Update specification:

Multiple operations are separated by a ';' (semicolon) character. A semicolon after the last operation in a request is optional. Implementations must ensure that the operations of a single request are executed in a fashion that guarantees the same effects as executing them sequentially in the order they appear in the request.

This way LP-ETL creates an in-memory RDF store only once, which saves it much effort. However, this optimization comes with a cost of making the pipeline more difficult to maintain. You need to maintain a potentially large SPARQL Update operations requests and you lose the intermediate debug data. Moreover, only successive SPARQL Update components with the same input edges can be merged.

This optimizations can be potentially automated in a non-debug LP-ETL mode, which can use other optimizations to reduce the pipeline's memory and disk-space footprint. However, non-debug mode is a large feature and it can take long before it is implemented.

I initially suggested using "chunking" input CSV files. Many pipelines with CSV input produce the same results when we run them with a complete CSV file and when we run them with chunks of the CSV file and finally merge the output RDF. In other words, it is usual that transformations of CSV do not use any relationships between the CSV's rows. Chunking would significantly reduce memory consumption and allow better parallelization. However, LP-ETL does not support this, nor it supports processing multiple RDF data units in most components, so implementing this optimization would also be a lot of work.

Other optimizations LP-ETL may try include stream parsing (e.g., in Excel to CSV) or binding iterators as used in tarql to reduce memory consumption of CSV to RDF conversion.

jindrichmynarz commented 8 years ago

It may be also useful to check the sizes of the intermediate RDF data units produced by the pipeline. In the pipelines I developed I sometimes found that some SPARQL Updates or CONSTRUCTs materialized unnecessary cross products of bindings, thus performing needless computation, and blowing up the size of the resulting data unit. You can check whether the size of the intermediate data grows linearly: if there is an exponential growth, it must be justified by the transformation.

pwalsh commented 8 years ago

@jindrichmynarz that is useful background information.

Both the original issue (a 20MB file taking an hour or more to process), and the additional info from Jindrich (around 1GB of data taking several days to process), highlight a quite severe performance issue (not a scalability issue as such) in Linked Pipes, or, the particular pipelines in question.

This performance issue is quite critical, possibly blocking, for the application of pipelines in WP4 (a live, integrated system) ( cc @HimmelStein @mlukasch ).

For reference, OpenSpending is regularly processing files anywhere from 1MB up to 1GB (probably can handle larger, as everything is implemented with streams in browser and in server, but AFAIK not tested on larger than that yet).

jakubklimek commented 8 years ago

First of all, sure, large files processed by many components in a pipeline (which is not streamed but sequential) is slow.

  1. Large files mean many RDF triples => large memory consumption by a triple store.
  2. Many components mean a lot of data copying, because so far, more or less, each component has its own copy of input and output data.

The combination of the two issues explains the resource intensiveness of the pipeline.

This is because in all use cases of LP-ETL so far, this was not an issue, therefore, we did not focus on that aspect (optimization) yet. Instead, we focused on functionality and debugging support. This means, that every pipeline essentially runs in "debug" mode, which gives you intermediate results of every operation, which is not necessary when the pipeline is fully developed and debugged.

There are multiple ways of optimization that we are aware of, some of them specific to this use case, however, the real question is the definition of performance requirements, which was not mentioned anywhere before. Processing in LP-ETL will never compare to plain text processing simply because it is not plain text processing. It works with triplestores which represent the individual triples as objects in memory, allow SPARQL querying etc. and those technologies are nowhere near the maturity of relational databases. What we can do is improve the way we work with triplestores to some degree, but we cannot improve the triplestores themselves and therefore there is a limit to the optimizations we can do in LP-ETL.

For example, 500M triples is usually the limit on number of triples in a triplestore on regular HW (16GB RAM). The simplest representation of a table in RDF takes approx. (# of cells + # of rows) triples. That corresponds to ONE table with 50M rows and 10 columns. Anything larger than that will always be a problem. On the other hand, that is simply not a good use case for RDF triplestores, nor does it represent a typical dataset.

So lets talk about the expected average size of the input files and the frequency in which they should be processed. We have already established that not every OS dataset will be processed by OBEU and so far, in the manually created pipelines, we have not seen an OBEU dataset (in any OBEU use case) that would be larger than we can handle now. I did not do an analysis of the datasets processed so far, but I suspect that a typical tabular budget dataset will be 100-10K rows?

Also, how often will such dataset come? I.e. what is the expected time to process such dataset?

Depending on these requirements, we can decide weather optimization in LP-ETL is possible, or another approach needs to be taken. There are many possibilities, the most extreme one being creating a single-purpose, highly-optimized, maybe even pure-text based transformer, omitting LP-ETL for the automated transformations. But is it really necessary?

schmaluk commented 8 years ago

Neither I know FDP nor the OBEU data model anything but well. So this might be all total nonsense. Would it help to split the FDP data (i.e. CSV files) on our side into smaller digestable parts and parallelize the pipeline processing by forwarding to multiple LinkedPipes instances. And afterwards merge the results? (This might not work, if those results would depend on each other...)

If it takes more time to process, this would be the lesser evil compared to an Out-Of-Memory. If LP is using its own in-memory triplestores internally, would it help to use triplestores which stores onto the filesystem?

pwalsh commented 8 years ago

@jakubklimek

Processing in LP-ETL will never compare to plain text processing simply because it is not plain text processing. It works with triplestores which represent the individual triples as objects in memory, allow SPARQL querying etc. and those technologies are nowhere near the maturity of relational databases.

Yes, I understand. Even so, would 1+ hours to make triples from a 20MB text file be considered acceptable performance for these technologies? If yes, then I have no issue with that, but clearly we have to communicate the fact that this puts significant restrictions on feasibility of an "integrated" live platform for budget data.

HimmelStein commented 8 years ago

This performance issue is quite critical, possibly blocking, for the application of pipelines in WP4 (a >live, integrated system) ( cc @HimmelStein @mlukasch ). @pwalsh yes, this will be a problem. besides of improving the pipelines, we can have more than one FDP2RDF pipelines on the platform. FDP files will be assigned to different FDP2RDF pipelines based on their sizes. Then, small FDP files will not be blocked.

jakubklimek commented 8 years ago

@pwalsh

Yes, I understand. Even so, would 1+ hours to make triples from a 20MB text file be considered acceptable performance for these technologies?

This depends simply on the number and complexity of operations. The input size is not the only variable. But as I explained before, there are optimizations that we can do, which may even significantly lower the time. But it would be helpful to know the expectations so that a reasonable direction of optimization can be taken.

this puts significant restrictions on feasibility of an "integrated" live platform for budget data.

Not necessarily. Some of the use cases are that a municipality or journalists deploy their own instance and play with their own data. For this usecase, this is not such a big issue. It is an issue for an instance processing large numbers of large datasets.

pwalsh commented 8 years ago

We've already agreed that WP4 is a centralized platform at its core, so, as far as data processing goes, we do indeed have a single instance processing large numbers of (potentially) large datasets. A muni or other org deploying their own instance would deploy front end apps, not data processing pipelines.

jindrichmynarz commented 8 years ago

A muni or other org deploying their own instance would deploy front end apps, not data processing pipelines.

If understand this correctly: a municipality that wants to take advantage of the OpenBudgets.eu platform must start by uploading their data in FDP to the platform and only then can the municipality deploy the front end apps that source the municipality's data from the platform. This might work if the OpenSpending packager suffices and no custom pipelines are needed. If custom pipelines are needed, then we either need to have an open instance of LP-ETL (probably not a good idea) or allow to deploy the platform not only including the front end apps but also LP-ETL.

pwalsh commented 8 years ago

@jindrichmynarz can you envisage a scenario where an administrator from a municipality, who are generally non-technical staff using Excel, will run custom pipelines in LP-ETL?

I've done a lot of work with government on data publication, and I can't possibly imagine such a scenario. It can be hard enough to produce a valid CSV file.

In any event, yes, As Soren and Fraunhofer have made clear numerous times, the public facing UI for adding data, outside of the custom pipelines from WP2, will be by using the OpenSpending Packager.

jindrichmynarz commented 8 years ago

Let's look at this from another perspective. We've seen that not all budget data comes in CSVs. For example, there are public sector institutions that have it in XML, HTML or exposed it via an internal API of their accounting software. If the only entry point to the OpenBudgets.eu platform is CSV (or similar tabular data, such as Excel spreadsheets) via the OpenSpending Packager, such data needs to be transformed to CSV. The transformation can be often done using LP-ETL.

Now, the question is if we want the OpenBudgets.eu platform to also cover this need. If so, we can "bundle" LP-ETL in. Otherwise, we can leave this task to other tools (or a separate LP-ETL instance). I think both options are perfectly fine.

can you envisage a scenario where an administrator from a municipality, who are generally non-technical staff using Excel, will run custom pipelines in LP-ETL?

While transforming non-CSV data to CSV requires technical prowess, it can be done by an external contractor or a technologist directly employed within the government (I've seen both).

As Soren and Fraunhofer have made clear numerous times, the public facing UI for adding data, outside of the custom pipelines from WP2, will be by using the OpenSpending Packager.

I'm not disputing that. I only wanted to remind us of the cases that are not covered by the OpenSpending Packager.

pwalsh commented 8 years ago

@jindrichmynarz yes ok.

The Packager is not the only entry point for data into OpenSpending in the wider sense, it is just the only interface for non-technical users. We have clear use cases for ingesting non-CSV data in OS, outside of OBEU, and are working on such pipelines in general, but it is definitely a case of supporting the long tail.

But of course, I agree with you in principle that OBEU can expose LP-ETL to such users. Going back to the original issue though, I'd still suggest that the current performance would likely be somewhat surprising to such users.

jindrichmynarz commented 8 years ago

I think we should go ahead and discuss the expected performance given the typical workload as @jakubklimek suggested above. His comment also gives background on why performance was not a priority in the development of LP-ETL so far.

However, even if we improve performance of LP-ETL, it would be still possible to develop slow pipelines. For example, SPARQL queries may perform unnecessary joins. When encountered a performance problem, the first step should thus always be to review the implementation of the pipeline. Once it is clear that the implementation is sound, we can proceed to isolate the root cause of bad performance and do something about it.

While we discuss the performance requirements on LP-ETL, I think @marek-dudas can investigate a bit if the FDP2RDF does not do some unnecessary work. When that is ruled out, we should examine the performance profile of the pipeline to see where improvements are needed the most.

marek-dudas commented 8 years ago

I think that there are some redundant operations in the pipeline, e.g. running a query over all input data instead of just the descriptor without CSV. I will look into it. Also, what might help would be something like if/else nodes in LP-ETL. @jakubklimek is there or will there be anything like that? What I mean is running a component only if some condition is true.

jindrichmynarz commented 8 years ago

You can usually express the condition as UNION in SPARQL in which both clauses start with the condition to reduce the numbers of bindings quickly.

jakubklimek commented 8 years ago

like if/else nodes in LP-ETL

@marek-dudas Would you specify a use case? This can usually be done in another way.

marek-dudas commented 8 years ago

For example, there is a series of queries dealing with hierarchical classifications. I would like to run those only if there is a triple with fdp:parent as a predicate. Considering what @jindrichmynarz suggests, maybe similar result can be achieved by simply adding e.g. ?a fdp:parent ?b . at the beginning of each query, to minimize its runtime? But this still won't prevent from going through all the possibly hundreds of MBs CSV data in some queries, even though the fdp:parent is always in the few-kB descriptor data: for example, the query that looks at each "row" from the CSV and links concepts with skos:broader to form the hierarchy has both the descriptor and the CSV on input and will thus probably still take quite some time to process even with the ?a fdp:parent ?b . line at the beginning. I will be glad for any further tips, I still don't know many details about SPARQL query processing.

jindrichmynarz commented 8 years ago

similar result can be achieved by simply adding e.g. ?a fdp:parent ?b . at the beginning of each query, to minimize its runtime

Does it reduce the number of bindings or is the number the same as for the graph pattern without ?a fdp:parent ?b .? This is a good rule: put more discriminating triple patterns first (although the query optimizer may do that for you).

I will be glad for any further tips, I still don't know many details about SPARQL query processing.

It would be best if we worked from concrete examples of SPARQL, for which you're unsure about their performance.

marek-dudas commented 8 years ago

So I did some preliminary optimization removing the most obvious redundant computations. We are still in the area of hours to process 20MB (I underestimated the first guess a lot apparently). I still think that the pipeline would benefit a lot from a way to decide to run/not run a query based on another query. Even if the triple pattern determining the rest of the query does not need to be resolved is at the beginning of the query, it still takes like 5 min to process 20MB CSV, which the query needs to have as input.

jindrichmynarz commented 8 years ago

Instead of conditionals from imperative programming languages, LP-ETL is closer to dataflow programming. I can think of two ways how to achieve implement conditional branching in LP-ETL. The first option is to fork a component's output as input to several SPARQL components, each of which determines if transformation should be applied by its graph pattern, then merging their results. The second option is to pass generated configuration to SPARQL components. You can have a SPARQL CONSTRUCT that generates a configuration to another SPARQL component based on its input. If the input doesn't match the CONSTRUCT's graph pattern, an empty configuration is produced, which effectively functions as a no-op.

pwalsh commented 8 years ago

@marek-dudas @jakubklimek @jindrichmynarz can we start with some more deterministic profiling here please? Marek's estimates are wildly different it would seem, and therefore it is quite hard to know how to even relate to the issue from the outside.

My suggestion would be simple:

then, per fragment and per pipeline:

Does that sound like a reasonable, if very minimal, start?

Also @marek-dudas your last sentence: does that mean that even to simply parse the 20MB CSV into some in-memory data structure takes 5 minutes, or does it mean something else?

jindrichmynarz commented 8 years ago

I agree. As I suggested in today's call, I think we should have an idea of the memory requirements of the pipeline. Based on experiments with files of different sizes, we should have a rough idea of the relationship between the processed file's size and the memory footprint of the pipeline (e.g., if we process X MB, then we need Y MB, and Y = n * X in the better case, or Y = Xn in the worse case). The memory profiling should be doable using the free VisualVM.

pwalsh commented 8 years ago

@jindrichmynarz interesting comment about dataflow programming! But, reading all above, I think @marek-dudas is chasing after conditionals because it will allow him to minimise the objects stored in memory (or, minimize intermediate state representations in general), which I am guessing is contributing to slow performance ( but let's see the data :) ).

As @jakubklimek stated earlier, this may in fact be an unavoidable aspect - creating triples in memory - of the design of the framework itself.

Dataflow programming suggests great things like working on continuous streams, easy parallelisation, no shared state, and so forth, but from all written above, it seems like that is a step away from the LP-ETL paradigm, which needs to make many representations of the full dataset, in memory, on the way to the final output of the pipeline.

jakubklimek commented 8 years ago

Actually, rather than profiling, which will only confirm that the current performance is bad, I would focus on refactoring the pipeline, which seems way too complicated at the moment. If it is in fact necessary to have such a complicated pipeline, then we could go back to @marek-dudas original suggestion to implement the transformation in Java as a single component, which would save the SPARQL and data copying overhead.

pwalsh commented 8 years ago

I can't see why profiling is not a necessary step right now. Of course it will confirm current performance is bad, but it will provide a baseline for analysis, and setting up some off the shelf profiler, as Jindrich suggests, would, I imagine, be a rather simple process.

jakubklimek commented 8 years ago

Well I would use a profiler when I would want to see what the problem is. Since it is quite clear what the problem is, I don't see the point, except for getting more precise numbers on memory consumption and runtime. But those numbers will not help me with the optimizations, they are only informative and useful if then we would incrementally improve something and wanted to see the improvement in those numbers.

Moreover, there are only two possible courses of action I can see, depending on the expected runtime and memory consumption.

  1. Go ahead and optimize the pipeline, but this way we will never make the memory consumption and runtime comparable to text processing of a CSV file, as the loading of triples in each step is simply unavoidable this way => will not work for larger files even if the pipeline had only a few components, which is unrealistic. This is simple math as I wrote above. An in-memory RDF representation of a table is much larger than in a relational database, and of course much larger than transforming the file as text, row by row. Even if we added support for streamed processing of tabular data to the LP-ETL framework (and all components used in the pipeline), we would have to rewrite the pipeline to take advantage of this optimization as it would not help much in its current state because of its complexity and the queries executed in it.
  2. Rewrite the transformation completely as a single component (or even standalone transformer) in Java (which @marek-dudas said would be much easier) and focus on streamed text-based processing there, reading the input CSV row-by-row, changing it using the JSON descriptor and outputting triples as text. We would lose the advantages of seeing the intermediate results, using SPARQL functions and the debugging capabilities of LP-ETL, but we would gain the performance.

This is why we need to decide what the requirements are and choose the appropriate course of action, and since I have a feeling we are heading towards 2., I think profiling the pipeline now is not necessary.

jindrichmynarz commented 8 years ago

I agree that profiling in the sense of memory consumption and CPU load would be more useful later, when the pipeline is more or less ready, so that we know what hardware is needed if we want to process files of given size. However, profiling at this point would help estimating the complexity of the pipeline. If we see linear growth in consumed resources corresponding to the file size, that should be OK, however, if the growth is exponential, there might be a problem in the pipeline's implementation (e.g., wasteful joins in SPARQL queries).

Moreover, even at this stage it would help to do "profiling" with what is provided directly by LP-ETL, which are the sizes of inputs and outputs of the pipeline's components. The sizes are a bit hidden in the current version of LP-ETL (cf. linkedpipes/etl#219), however, it is still possible to go to the FTP browser and see the sizes there. If there is a component that makes the processed data grow significantly, then it may be the culprit contributing to the poor performance of the pipeline.

I'm not sure if we need to implement the pipeline in Java to improve performance. For example, yyz1989/NoSPA-RDF-Data-Cube-Validator had the same motivation: SPARQL queries for some DCV integrity constraints were slow, so it reimplemented them in Java. I run into the same problem with the DCV validation pipeline fragment, so I rewrote the problematic queries to be generated based on the input data, which achieved the same runtime as the NoSPA RDF Data Cube Validator. I think the lesson learnt here is that two queries, one exploring the input data and the other processing the input, can be frequently much faster than a single generic query.

As we discussed above in this thread, there are several performance optimizations that can be done with the current LP-ETL. One of them is merging SPARQL Updates to avoid copying data. However, the pipeline uses mostly SPARQL CONSTRUCT queries instead of SPARQL Updates, so applying this optimization would require partial reimplementation of the pipeline.

pwalsh commented 8 years ago

@jakubklimek ok then, I would have thought the preference is for your option 1, as, let's be clear, the example provided by @jindrichmynarz at the top of this thread shows that this is not a problem specific to the FDP to RDF pipeline. If you want to move towards 2, then I agree that profiling is not so important.

As for deciding the requirements:

Well, as I see it - the requirements have always been quite clear, as WP4 uses work from other WPs to provide a live, integrated system, and therefore at a very high level, we can say that performance has to be related to common properties of such systems - multiple requests to a pipeline, usage of the results of the pipeline by end users with minimal surprise, and so on.

Setting a requirement based on file size is something we can do I guess, but it is somewhat arbitrary, as, like you said earlier, file size is only one variable among many.

I recommend that you, and @marek-dudas meet with @HimmelStein @mlukasch and @badmotor, who represent the WP4 lead team, and make a decision together based on what is achievable and reasonable. Does that sound good to you?


If it helps, I can give an example of how we designed for similar use cases in OpenSpending components that are not OBEU specific: we worked with pilot partners and the history of knowledge we already have at OKI for how fiscal data looks in the wild. We understood that for end users, there is little to no concept of the difference between a 1MB, a 60MB or even a 500MB file of data, and therefore we made technological decisions to reduce friction for users, by processing streams of data from raw sources, and writing it in chunks to derived databases (Postgres, Elasticsearch, etc.).

jakubklimek commented 8 years ago

@jindrichmynarz If you take a look at the pipeline, there are not much SPARQL Updates to merge. Most of the components are SPARQL constructs, and complicated ones with OPTIONALs, VALUEs and nested queries, which are all know to reduce performance.

Also, comparison with RDF data cube validator is not good, because there the input is an RDF data cube (cannot be processed row by row and needs to query over the entire dataset) and here the input is CSV (which can be efficiently processed row by row and does not need to query over the entire file) and therefore the implementation in Java can be much more efficient than a pipeline in LP-ETL without the need to load more than a single CSV row + the JSON decriptor into memory at once.

@pwalsh The preference was for option 1 before there were performance requirements. With these requirements in place now (an the time given for development) I am starting to prefer option 2 for FDP2RDF, leaving LP-ETL for the manually created pipelines where its capabilities are better used and with more time to implement the possible optimizations, without delaying the integration.

jindrichmynarz commented 8 years ago

If you take a look at the pipeline, there are not much SPARQL Updates to merge. Most of the components are SPARQL constructs

Yes, that is what I mentioned in my comment. However, I think that reimplementing SPARQL CONSTRUCTs as SPARQL Update is possible with some effort.

Also, comparison with RDF data cube validator is not good, because there the input is an RDF data cube (cannot be processed row by row and needs to query over the entire dataset) and here the input is CSV (which can be efficiently processed row by row and does not need to query over the entire file)

Aren't there foreign keys in FDP?

jakubklimek commented 8 years ago

Yes, that is what I mentioned in my comment. However, I think that reimplementing SPARQL CONSTRUCTs as SPARQL Update is possible with some effort.

I agree. The question is the amount of effort and the performance gain.

Aren't there foreign keys in FDP?

This is something to discuss with @marek-dudas , but usually foreign keys can be handled by consistent URI generation. And even if not, the complexity still does not compare to loading the entire dataset to memory in its RDF representation.

All I am saying is that if performance is a critical factor, it is not a good use case for the current LP-ETL pipeline. Additionally, the effort spent on optimizing LP-ETL and the pipeline is probably going to be much greater with less effect than the effort needed to reimplement the transformation in Java (generating RDF dumps in a text-based manner). But this is up to @marek-dudas to say.

pwalsh commented 8 years ago

@jindrichmynarz yes there are foreign keys in the spec, but as we've communicated elsewhere they are not really used in actual data as of yet for Fiscal Data Package ( @akariv can confirm ). FKs do not necessarily contradict row wise processing, it just means you might design iterations around building lookup tables for FKs, it definitely does not mean the whole dataset needs to be held in memory.

jindrichmynarz commented 8 years ago

FKs do not necessarily contradict row wise processing, it just means you might design iterations around building lookup tables for FKs, it definitely does not mean the whole dataset needs to be held in memory.

Yes. I only mentioned foreign keys as an example that doesn't allow processing each row in isolation.

marek-dudas commented 8 years ago

I agree with @jakubklimek . If we want the FDP2RDF to be fast, the safest choice is to reimplement it by hard-coding in Java. It should be doable in very short time, say, one week including preliminary testing and debugging. Although I hate the idea of throwing away months of work and I really like LP-ETL, I think that even with the best refactoring&optimization effort, we won't get below (many) hours for 100+ MB input. If we want to be sure, I could try rebuilding the pipeline into UPDATE queries (instead of CONSTRUCTs), which I originally planned to do anyway at some point. It seems that a union component node, which just merges data from several other nodes, can take like 8 minutes, which is quite surprising to me, so I think there is a potential for some significant amount of speed-up. So if you want fast FDPtoRDF transformation soon, go with Java right now. (And notice I am not saying "I told you so right at the beginning" :-) If we can afford some experimentation time (at least a week), let's try UPDATE first.

A bit unrelated but I think now becoming an important information is that I was considering prolonging my vacation, which was supposed to be just from today to Sunday, for the whole next week. Now I am not sure if I can afford it...

pwalsh commented 8 years ago

@marek-dudas @jakubklimek @jindrichmynarz @badmotor @HimmelStein @mlukasch

All good! But let's not keep casting this as an FDP2RDF issue: The first comment in this thread by @jindrichmynarz clearly shows the performance issues are not specific to this pipeline.

jakubklimek commented 8 years ago

@marek-dudas I think you can prolong the vacation safely. I can imagine the reimplementation can still be a LP-ETL component (to leverage the loading and downloading components etc.) so this does not really affect the integration and we can still work on it even after the prototype deadline, which is in September, right?

@pwalsh The FDP2RDF issue part is the probably unnecessary pipeline (and query) complexity. The LP-ETL issue part is the missing support for streamed processing of tabular data, which is only one of many use cases. The support for streamed processing would help with @jindrichmynarz pipeline from the initial comment, where memory consumption is the main issue and we can afford longer runtime for the pipeline, as manually created pipelines are not likely to be executed so often. The SPARQL Update component merging as well as a non-debug run mode would help with the runtime a bit. However, I think none of these optimizations can help the FDP2RDF pipeline in its current form significantly (it is currently not compatible with streamed processing and the main bottleneck are the queries which are processed by the triplestore, not the debugging functionality).

schmaluk commented 8 years ago

@marek-dudas How Long would it take to complete the Change in a rough estimation? I dont see other Tasks blocked by this except the testing of the Pipeline. So you would not Need to skip your vacation. There is already OBEU-data which we can use. So we dont necessarily depend on the OBEU-data produced by the FDP-to-Rdf-Pipeline for other development tasks.

@pwalsh I think if we can handle the processing of 1 file with a common file size, we can later adjust for simultaneous load by multiple users by having multiple linkedpipes instances with a load balancer distributing the load. This can be done later after the prototype.

pwalsh commented 8 years ago

@marek-dudas enjoy your vacation! Health and life before work :).

@jakubklimek the software engineer in me is crying when you say we can "afford a longer runtime" in reference to a pipeline that takes "several days" to process 1GB of output turtle data, but there is no point in arguing over that I suppose, and I will just accept that this is considered reasonable performance for the technologies employed there.

@mlukasch happy for anything you decide! definitely we do not need to rush anything for the prototype, that is agreed.

marek-dudas commented 8 years ago

I should be able to reimplement the FDP2RDF before the end of August.

HimmelStein commented 8 years ago

no matter how large a csv is, it only has two parts: one column row, and content rows. Given a very large csv file, we create the first csv file which is just the column row and the first ten content rows. Then feed it into the pipeline. We will have the first RDF file. Then, we construct the second csv file by taking the column row, and next 10 rows (11-20th row). and feed it into the pipeline, which will do faster than processing the first, as some information already there, e.g. DSD. ... at last, we concatenate all transformed code-lists, RDF datasets together. We can say, the processing time is almost proportional to the size of the csv file.

jakubklimek commented 8 years ago

@HimmelStein This will take care of the memory problem, but I suspect the time to process will be even longer. Let's see what @marek-dudas comes up with.

schmaluk commented 8 years ago

Have talked with Fabrizio as well. It might be good to agree on some performance goals for the pipeline in advance as orientation since this can influence some decisions on the implementation. (Which we should have communicated in advance anyway... sorry about this!!) If the pipeline can handle a FDP-file of a common size (100MB) of OpenSpending within a few hours (maybe <4 hours), that would be sufficient. In any case thanks a lot for your hard work here.

marek-dudas commented 8 years ago

Just commited a new version of the pipeline which uses a new FDPtoRDF LinkedPipes component processing the CSV directly in Java. It is much much faster now. The only known issue is dropping the support for multiple CSVs in one datapackage, which is AFAIK not yet supported by OpenSpending itself. It will be implemented later, I just wanted to save time at the moment.

HimmelStein commented 8 years ago

great! hope it works well this time.

marek-dudas commented 7 years ago

Since the current performance of the pipeline (about 7 minutes for 700MB CSV) seems to be acceptable, I am closing this issue. Reopen if anyone disagrees.