waynebhayes / SANA

Simulating Annealing Network Aligner
25 stars 39 forks source link

Made all graph loading compatible with compressed files, created utility for easily obtaining file pointers and streams for compressed files. #71

Closed rasulsafa closed 5 years ago

rasulsafa commented 5 years ago

Prior to this pull request, SANA only had rudimentary support for reading compressed files, the only instance of which being when loading Similarity Matrix files. In order to make existing SANA code compatible with compression, I have created two helper functions in utils: FILE* readFileAsFilePointer(const string& fileName, bool& piped) and stdiobuf readFileAsStreamBuffer(const string& fileName)

The first is used for obtaining a C-style FILE* of the passed file name. If the passed in file is uncompressed, it'll return a FILE* generated by fopen. If the file passed in is compressed, it'll return a FILE* generated by popen. It will also change the value of a passed in boolean reference to true if the the returned FILE* is piped (a FILE* generated by popen). Whether or not a FILE* was generated by popen or fopen is important because we need to know if we're going to close it with fclose or pclsoe later. I've created a simple helper function called void closeFile(FILE* fp, const bool& isPiped) that will close the FILE* with the correct function given the file pointer and boolean isPiped.

Example usage:

bool isPiped;
FILE* infile = readFileAsFilePointer("myfile.gz", isPiped); // pass in compressed or uncompressed file
fscanf(infile, ...);
closeFile(infile, isPiped);

The method I described above works great in places like ExternalSimMatrix where C-style file I/O is used; but most of SANA uses C++-style streams. In order to make compression compatible with streams, I had to convert the FILE* generated by popen and fopen into a stream. This required a class called stdiobuf that creates a buffer from a FILE* that is then passed into the constructor of an istream. The standard library only allows the creation of istreams, not ifstreams from buffers. (istream is the base class of ifstream and have almost identical functionality). So from now on, instead of doing ifstream infile("myfile.el"), we do

stdiobuf sbuf = readFileAsStreamBuffer(fileName); // pass in compressed or uncompressed file
istream infile(&sbuf);
string line;
getline(infile, line);
...

Note that close() isn't called the stream; this is because it is all dealt with in the destructor of stdiobuf, so the file is closed whenever the buffer falls out of scope/delete is called on it.

Some other utilities I've made are string getDecompressionProgram(const string& fileName) that will return the decompression program for a given file, and string getUncompressedFileExtension(const string& fileName) which will return el from something like AThaliana.el.gz.

The decompression programs SANA is compatible with are gzip, xzcat, and bzip2. (I added bzip2)

In this pull request, I've only made graph loading and external sim matrixes use the new utilities since it takes a while to test the functionality after swapping to our new utility (ifstream and istream aren't 100% compatible)

rasulsafa commented 5 years ago

Alright, will do