mtokuyama / ERVmap

ERVmap is one part curated database of human proviral ERV loci and one part a stringent algorithm to determine which ERVs are transcribed in their RNA seq data.
http://www.ervmap.com
21 stars 13 forks source link

installation / usage instructions #1

Open jgbaum opened 5 years ago

jgbaum commented 5 years ago

Hi,

I run the bioinformatics core at La Jolla Institute for Immunology (lji.org), and one of our researchers is interested in running your tool. I've had a quick look through the repo and was unable to find any documentation in terms of installing or using the code. Please let me know if you can provide this.

Thanks very much,

Jason

messersc commented 5 years ago

Hi Jason,

I am not one of the authors but also looking into how to use the tool. It looks like two perl scripts, which do read trimming with btrim and mapping with bwa mem afterwards and then some custom filtering. Not sure how easy it would be to get it running on anyone else's machine.

My strategy for the moment will be to use the provided bed file to extract the sequences from hg38 and use this fasta with salmon to quantify expression. Not sure how well this will work, though.

Cheers, Clemens

shokohirosue commented 5 years ago

@messersc @jgbaum

Hi Jason and Clemens,

I was also thinking of using the tool and came here. I just wondered how you got on with the analysis. Does it seem to work without their specific custom filtering? Any advice and suggestions would be greatly appreciated.

Thank you so much.

Best, Shoko

LimWChing commented 5 years ago

Hi Jason, Clemens and Shoko,

Glad that there are others trying to run this too. I did a clone of this directory and decided to try my luck if I could run those commands according to the README page. I managed to get an output from the first step of the pair-end reads run interleaved.pl --read1 ${i}_R1.fastq.gz --read2 ${i}_R2.fastq.gz > ${i}.fastq.gz

So I proceed to the next step calling the erv_se_genome_v2.pl but it returns the help info instead. I guess we need to provide all options to run the script? Any idea on this?

Also, for the --filteroption it says the path to parse_bam.pl file, but I do not see it within this directory. Anybody knows where to get this file?

Thank you.

Best, Wan Ching

bounlu commented 4 years ago

Hi all,

After a couple of painful days, I made this tool work for our samples. I had to debug and modify the scripts a lot to make it work though. The code has been written not in a generalized way, and use older/deprecated versions of the tools.

Note that Tophat2 didn't work in my case that I got segmentation fault (core dumped) error in tophat_reports step (as discussed here). So I switched from Tophat2 to STAR for the second alignment part, gene expression quantification, which is a way better aligner anyway.

Also note that the authors say they use DESeq2 in the methods of the paper but actually use DESeq in the script, which is the older version.

@jgbaum you just download the github directory and unzip it to a folder, then add this folder to your PATH do make the executables work, no installation required.

@LimWChing you have to comment out the lines starting with pod2usage(1) unless to avoid seeing help info if you are sure you provide the right input parameters. Make sure you provide all input parameters even if you start from a mid-stage, and provide absolute paths for the files. I see parse_bam.pl file has been added to the repo, probably after your message.

Hope these help some people.

Best. Ömer

rained1989 commented 4 years ago

@bounlu Hi bounlu,

How did you run the script 'run_clean_htseq.pl ./output/cellular c c2 __' successfully? There seems to be a bug in this program at line #30. Thanks.

Best, dong

bounlu commented 4 years ago

@rained1989 Hi,

It was long time ago that I don't remember, but maybe I can recall if you post the error you get.

Best.

rained1989 commented 4 years ago

@bounlu Hi bounlu,

Thanks. The author fixed the error by adding a new script 'clean_htseq.pl'.

Best.

jcarlosangel commented 4 years ago

@bounlu Hi, I'm having troubles with TopHat2 alignment. I saw that you changed this step by using STAR. Can you please dig a little bit into this? Which files did you used for this? How did you changed the main program to use STAR?

Thank you.

bounlu commented 4 years ago

@jcarlosangel

Hi,

I did the alignment with STAR by myself for each sample with the default parameters. The only output you need is the .bam file. Then I changed the erv_genome.pl script as follows:

my $bam = "Aligned.sortedByCoord.out.bam" # "accepted_hits.bam";

This way you can skip alignment step with Tophat in stage5, just let the indexing:

# tophat
if ($stage <= 5 && $stage2 >= 5) {
#    &cmd($aln_cmd);
    chdir("GRCh38");
    &cmd($bamindex_cmd2);
}

Then you can run it as follows. Note that you still need to give the full input parameters (--tophat, --transcriptome) even if you are not using them so that the script does not complain:

erv_genome.pl --stage 1 --stage2 6 --fastq S1.fastq.gz --btrim btrim64 --tophat tophat2 --bwa bwa --samtools samtools --filter parse_bam.pl --bedtools bedtools --genome hg38.fa --genome_Bowtie2 hg38 --bed ERVmap.chr.bed --genomefile hg38.chrom.sizes --gtf hg38.genes.gtf --transcriptome hg38_tophat --adaptor illumina_adapter.txt --cell S1

jcarlosangel commented 4 years ago

@bounlu Thank you so much, I appreciate your help.

bounlu commented 4 years ago

@hchintalapudi Yes, indeed it is the perl version of the R code normalize_deseq.r. More simply, it's the equivalent of counts(dds, normalized=T) in DESeq2.

hchintalapudi commented 4 years ago

@bounlu That's a relief! So I changed a lot of params like you people before reading this thread. Like: replaced Tophat with STAR and broke up the source code to separate parts and such.. And so I was worried for not following the downstream cleaning scripts. So you're saying that dividing the ERV counts by the cellular counts' size factors (these cellular counts generated from STAR alignments and read counting) is accurate right? And this final ERV matrix can be used for analysis? Thank you.

bounlu commented 4 years ago

@hchintalapudi Ohh yes you are right that you still need to divide the ERV counts by the cellular counts' size factors you got from DESeq2 and that's it with ERVmap part, please ignore my previous post which I was not clear. Next, what I did was to calculate the average fold change between the test and the control replicates to define up and down regulated ERVs. Then I handed the list to the wet lab people for validation, which was in the era when no corona was around.

hchintalapudi commented 4 years ago

Thank you @bounlu ! Your feedback was helpful. Even I did everything in the pre-Corona era. :D I applied DESeq function after supplying these size factors to the dds object and then went on with the diff exp analysis

jcarlosangel commented 4 years ago

Hi @hchintalapudi ! So, did you normalized once you were performing the diff exp analysis with DESeq? or did you use the final default EVRmap counts file to perform it? I'm asking because when I tried to replicate the results reported in ERVmap paper there were a lot of inconsistencies! Sincerely don't know why since I didn't change anything from the core script.

Thank you!

hchintalapudi commented 4 years ago

Hi @jcarlosangel, I used raw ERV counts and supplied the size factors (from cellular genes) in DESeq2 for normalization. Hope that helps!

hafizmtalha commented 2 years ago

@jcarlosangel

Hi,

I did the alignment with STAR by myself for each sample with the default parameters. The only output you need is the .bam file. Then I changed the erv_genome.pl script as follows:

my $bam = "Aligned.sortedByCoord.out.bam" # "accepted_hits.bam";

This way you can skip alignment step with Tophat in stage5, just let the indexing:

# tophat
if ($stage <= 5 && $stage2 >= 5) {
#    &cmd($aln_cmd);
    chdir("GRCh38");
    &cmd($bamindex_cmd2);
}

Then you can run it as follows. Note that you still need to give the full input parameters (--tophat, --transcriptome) even if you are not using them so that the script does not complain:

erv_genome.pl --stage 1 --stage2 6 --fastq S1.fastq.gz --btrim btrim64 --tophat tophat2 --bwa bwa --samtools samtools --filter parse_bam.pl --bedtools bedtools --genome hg38.fa --genome_Bowtie2 hg38 --bed ERVmap.chr.bed --genomefile hg38.chrom.sizes --gtf hg38.genes.gtf --transcriptome hg38_tophat --adaptor illumina_adapter.txt --cell S1

quoting this answer after a long time, just wanted to ask is there anyway I could make tophat to run in the script ? as I dont have enough RAM to run STAR. or any solution to it ?