mff-uk / odcs

ODCleanStore
1 stars 11 forks source link

The dpu design issue dialog #1324

Open Jan-Marcek opened 10 years ago

Jan-Marcek commented 10 years ago

Some dialog's functionality is suitable to put into the core. The things such as uploading and read from a file.

odcs version: release-1.0.0

skodapetr commented 10 years ago

Do you have any suggestion how should this be realized ?- from the user point of view. Should core should provide some easy to use component?

But generally I agree that we may move some common (dialog) functionality to a single shared place for example commons-module.

Jan-Marcek commented 10 years ago

Trochu sme sa nad tym zamýšlali ako to spraviť najlepšie s Michalom. Bolo by rozumné vytvoriť samotný napr. http extractor ten by zobral dané dáta a dal by ich na vstup ďaľšiemu DPU, ktoré by potom spracoval. Potom by konštrukcia vyzerala napr:

http downloader -> unzipper -> rdf extractor Ak sa nemýlim tak teraz je unzipper spravený tak, že sám sťahuje dáta z http adresy. Tak aby sa mohli vytvarať pipeliny: read file from a disk location -> unzipper -> rdf extractor. Dobré by bolo spraviť to tak aby nebol problém spracovať zabalený csv file ktorý by spracoval nejaký csv extractor. Mometálne sa medzi dpu posiela len rdf.

Alebo druhý spôsob by bol, že sa tieto zdielané funcke vložia do commons-module tak ako píšeš. Pri tomto spôsobe zase nie je potrebné riešiť prenos nejakých csv, xls súborov, keďže všetko by bolo ponechané na samotné DPU

Tomáš by sa mohol k tomu vyjadriť čo si o tom myslí.

skodapetr commented 10 years ago

Aha já myslel, že myslíš jako komponentu do dialogu, tohle mění situaci. Info za mě: Aktuální ODCS podporuje i předávání souborů a adresářů pomocí file data unit. To DPU jak jsi popsal mi přijde skoro ideální pro "workshop" - a ze strany funkcionality mi to nezní špatně. Jen by se muselo vymyslet dost univerzálně a pak zaintegrovat do existujících dpu.

ghost commented 10 years ago

Ahojte, som zvedavy ci github umozni reply cez mail.

Ja by som isiel dale a sprawl by som niece okolo output/inputstream (pipelineinputstreeam, pipelineoutputstreeam). Ono totiz na zaciatku je nejaky data extractor, ten by iba sypal data do ouputstreamu na nasom prenosovom objekte. Nas prenosovy object by dal dalsiemu dpu v poradi zase inputstream z ktoreho moze citat. Napriklad\ http extractor, zapojis za to unzipper, za to csv reader, za to tabular extractor a z neho uz ide RDF do normalnych dpu. potom pridu transformer dpus, ako su teraz a na druhej strane pipeline su zase loaders a tam je to to iste, tabular loader, vytvori len outputstream, ten pride do nejakeho XLS writer, ten pride do bzip2 ten pride do ftp uploader…

cize rdf budeme mat len od (extractor-transformer-loader) a tie kucerave veci okolo postavime na pipelinovanych streamoch (nas transfer objekt je defactor pipeline stream, je na to podpora v java).

samozrejme prvy komponent ktory nemoze robit “by chunks” pokazi ten system toho ze ako tak vieme loadovat po castiach do rdf a nacita cely streeam do pamate. ale to uz sa nic neda robit, niektore formaty jednoducho nejdu streamovat. (neoptimalizovane PDF napriklad, neviem ako XLS).

On 13.3.2014, at 15:28, Petr Škoda notifications@github.com wrote:

Aha já myslel, že myslíš jako komponentu do dialogu, tohle mění situaci. Info za mě: Aktuální ODCS podporuje i předávání souborů a adresářů pomocí file data unit. To DPU jak jsi popsal mi přijde skoro ideální pro "workshop" - a ze strany funkcionality mi to nezní špatně. Jen by se muselo vymyslet dost univerzálně a pak zaintegrovat do existujících dpu.

— Reply to this email directly or view it on GitHub.

tomas-knap commented 10 years ago

I agree that some shared functionality (e.g., function for unzipping archive) should be provided in some common utils classes visible to all DPUs. Actually, if you check DataUnitUtils in commons module, there are methods for storing string to a file, reading file to a String etc.

Regarding the particular example of extractor, yes, it would be more versatile, if we have three DPUs - one for downloading data, one for unzipping data and one for extracting data. This could be achieved easily (as long as the list of DPUs available is not too long), because we support file data units - exchanging files between DPUs.

Ad streams, this should be probably discussed in a different issue. I agree that we can get some performance gain while using streams, but this can be implemented quite easily for file data units, but it has limited use for RDF data units, as this would mean to introduce notion of "streamable" and "non-streamable" RDF operation. Also it is more complicated when more inputs are consumed by the DPU etc. So I would postpone streaming to later phases of the project, it is a complex feature I would not like to introduce now

ghost commented 10 years ago

cau,

Regarding the particular example of extractor, yes, it would be more versatile, if we have three DPUs - one for downloading data, one for unzipping data and one for extracting data. This could be achieved easily (as long as the list of DPUs available is not too long), because we support file data units - exchanging files between DPUs. sam pises v tej novej specifikacii k tabular data, ze to ma byt ‘SAX-like’ tak ked nie takto streamami tak neviem ako chces spravit ‘SAX-like’

Ad streams, this should be probably discussed in a different issue. I agree that we can get some performance gain while using streams, but this can be implemented quite easily for file data units, but it has limited use for RDF data units, as this would mean to introduce notion of "streamable" and "non-streamable" RDF operation. Also it is more complicated when more inputs are consumed by the DPU etc. So I would postpone streaming to later phases of the project, it is a complex feature I would not like to introduce now samozrejme tie streeamy chodia medzi DPU ktore existuju pred extractorom a po loadri. Vnutri projekt ostava rovnaky.

On 13.3.2014, at 17:58, Tomas Knap notifications@github.com wrote:

I agree that some shared functionality (e.g., function for unzipping archive) should be provided in some common utils classes visible to all DPUs. Actually, if you check DataUnitUtils in commons module, there are methods for storing string to a file, reading file to a String etc.

Regarding the particular example of extractor, yes, it would be more versatile, if we have three DPUs - one for downloading data, one for unzipping data and one for extracting data. This could be achieved easily (as long as the list of DPUs available is not too long), because we support file data units - exchanging files between DPUs.

Ad streams, this should be probably discussed in a different issue. I agree that we can get some performance gain while using streams, but this can be implemented quite easily for file data units, but it has limited use for RDF data units, as this would mean to introduce notion of "streamable" and "non-streamable" RDF operation. Also it is more complicated when more inputs are consumed by the DPU etc. So I would postpone streaming to later phases of the project, it is a complex feature I would not like to introduce now

— Reply to this email directly or view it on GitHub.

tomas-knap commented 10 years ago

Michale, tedy kdyz to shrnu, tak ti jde o to mit nejake utils metody dostupne vsem DPU, ktere napr."stahnout soubor z daneho HTTP URL a ulozi do prislusneho souboru", streamove.

?