raulcf / SEEPng

8 stars 12 forks source link

HDFS Integration #68

Closed WJCIV closed 8 years ago

WJCIV commented 8 years ago

Created a couple new settings in the FileConfig to allow the user to specify that files reside on HDFS. Read (source) and write (sink) capabilities implemented.

WJCIV commented 8 years ago

Why not use a single IOException for all the code?

The nested one there performs a different function. There is no option to create-or-append. You need to create (which overwrites if the file exists) or append (which fails if the file does not exist). Traditional "clean" code would check if the file exists, then create or append as appropriate, but there is the potential for a race condition if another process creates the file in between your check and your create. By trying to open it first and catching the exception and appending in that case we get rid of most of the races.

WJCIV commented 8 years ago

Not sure if we need both the hdfs.source AND the hdfs.uri variables. We could simply assume if hdfs.uri is set that we are talking about an HDFS file.

That's certainly an option, and one I considered. I just thought this was a little more consistent with the user interface we provided for other options. Since this is user facing consistency might be more important than minor space efficiency issues.

raulcf commented 8 years ago

has this pull request been tested in an HDFS deployment? It'd be great to have a working deployment where we can put all the datasets we want to use and test all this.

the other thing: please, make sure this does not break the previous examples that were using FileSource or TextFileSource.

As soon as all this is done I'm merging!

pgaref commented 8 years ago

Actually we do have an HDFS test cluster - @WJCIV set that up and I think he is using it for the development as well - I could create an extra query example reading from some real datasets to test HDFS as well.

WJCIV commented 8 years ago

We don't have a dedicated HDFS cluster, but it takes 20 minutes to set one up on the wombats (then 10 seconds to turn it on and off). We'll be able to run the tests we want, but it won't do for long term storage.

WJCIV commented 8 years ago

I tested text input/output during the original run. Now I've tried binary, and there appears to be some issue. I can output something in binary mode, but it is not read back in successfully. Not sure yet whether the output format is a bit off or the input code has an issue.

WJCIV commented 8 years ago

Okay - I need another set of eyes on this. The problem is that this code cannot read binary input. However, that's not new to this checkin. I've tried the master branch on LSDS and it has the same problem. I'm not sure what the difference is since last month that affected this. From reading the code it might be because there is no file header stating the number of tuples, but that's not a new addition.

My test code reads in strings from a file, converts to upper case, and prints them out (minor changes to the simple-sourcefile example). To test binary input/output I output to binary in one test, then read that file in on the next. Output from this checkin looks the same as the code in the master branch generates.

pgaref commented 8 years ago

Could take a look later today - Do you mind creating a branch with all the test code?

Cheers

WJCIV commented 8 years ago

Done. Called it TestCode. It doesn't have the HDFS stuff, but that's just a few more FileConfig properties in Base.java.

raulcf commented 8 years ago

Let me know when you guys have figured this out :)

pgaref commented 8 years ago

@WJCIV After a better look in the plain FileReader (reading a String as schema) example using a MarkerSink :

Will continue debugging tomorrow. I am also opening an issue regarding the File OutputBuffer Size we discussed.

WJCIV commented 8 years ago

It looks to me as though the input is behaving the same on HDFS as a local file system, so the fact that the input is not working is orthogonal to this branch. I'm going to work on a separate branch to try to fix that.

WJCIV commented 8 years ago

We've explained the input and output problems, and they are not related to these changes. This should be good to merge.

raulcf commented 8 years ago

done, thanks!