tfwillems / HipSTR

Genotype and phase short tandem repeats using Illumina whole-genome sequencing data
GNU General Public License v2.0
94 stars 31 forks source link

Reading CRAMs from S3 bucket #96

Open james-guevara opened 5 months ago

james-guevara commented 5 months ago

Hi all,

I’d like to call TRs on CRAMs that are inside S3 buckets. Since htslib allows for reading CRAMs from S3 buckets, I ideally wouldn’t have to download these CRAMs to run HipSTR on them. However, this functionality apparently doesn’t work after compiling the most recent version of HipSTR, so I’m wondering if there are plans to add this functionality (or if I'm doing something wrong).

As of now, I just made some hacky changes to the HipSTR codebase so that I can read from S3 buckets.. (I'm just posting them here in case it is helpful if you do plan on adding this functionality later):

  1. First of all, regardless of what libraries I included when compiling htslib, the htsfile program wouldn’t allow me to read from an S3 bucket. I therefore replaced it with the most recent version of htslib (1.19.1).

  2. I added the -fpermissive flag to the default compilation flags in CXXFLAGS in the HipSTR Makefile (otherwise, I get a compiler error: lib/htslib/cram/cram_io.h:217:33: error: invalid conversion from ‘void*’ to ‘unsigned char*’ [-fpermissive], but I figure that it's safe to relax this type checking by using this flag).

  3. In the HipSTR Makefile, I added-lcurl -lssl -lcrypto to the LIBS (though I suspect these aren’t necessary).

  4. I replaced every bam_hdr_destroy with sam_hdr_destroy in the source code.

  5. I added #include <unistd.h>to src/bam_io.h and src/denovos/denovo_main.cpp

  6. I figure this is an egregious change, but for the sake of a quick fix, I modified the bool file_exists(const std::string& path) method in the src/bam_io.h file so that the program doesn’t error out when the path is to a CRAM inside an S3 bucket:

    
    bool file_exists(const std::string& path){
    // Check if the path starts with "s3://"
    if (path.rfind("s3://", 0) == 0) {
        // For now, just return true to proceed with S3 paths
        return 1;
    }
    return (access(path.c_str(), F_OK) != -1);
    }


Again, these are very hacky changes, but I am able to read from an S3 bucket now and HipSTR works, although I do eventually get an error if I run on all the chromosomes. If I run on just one chromosome, it finishes successfully. So I figure the problem is that the connection just tends to go down before HipSTR runs on all the chromosomes.