Open jgbaum opened 5 years ago
Hi Jason,
I am not one of the authors but also looking into how to use the tool. It looks like two perl scripts, which do read trimming with btrim and mapping with bwa mem afterwards and then some custom filtering. Not sure how easy it would be to get it running on anyone else's machine.
My strategy for the moment will be to use the provided bed file to extract the sequences from hg38 and use this fasta with salmon to quantify expression. Not sure how well this will work, though.
Cheers, Clemens
@messersc @jgbaum
Hi Jason and Clemens,
I was also thinking of using the tool and came here. I just wondered how you got on with the analysis. Does it seem to work without their specific custom filtering? Any advice and suggestions would be greatly appreciated.
Thank you so much.
Best, Shoko
Hi Jason, Clemens and Shoko,
Glad that there are others trying to run this too. I did a clone of this directory and decided to try my luck if I could run those commands according to the README page. I managed to get an output from the first step of the pair-end reads run
interleaved.pl --read1 ${i}_R1.fastq.gz --read2 ${i}_R2.fastq.gz > ${i}.fastq.gz
So I proceed to the next step calling the erv_se_genome_v2.pl
but it returns the help info instead. I guess we need to provide all options to run the script? Any idea on this?
Also, for the --filter
option it says the path to parse_bam.pl file
, but I do not see it within this directory. Anybody knows where to get this file?
Thank you.
Best, Wan Ching
Hi all,
After a couple of painful days, I made this tool work for our samples. I had to debug and modify the scripts a lot to make it work though. The code has been written not in a generalized way, and use older/deprecated versions of the tools.
Note that Tophat2 didn't work in my case that I got segmentation fault (core dumped)
error in tophat_reports
step (as discussed here). So I switched from Tophat2 to STAR for the second alignment part, gene expression quantification, which is a way better aligner anyway.
Also note that the authors say they use DESeq2 in the methods of the paper but actually use DESeq in the script, which is the older version.
@jgbaum you just download the github directory and unzip it to a folder, then add this folder to your PATH do make the executables work, no installation required.
@LimWChing you have to comment out the lines starting with pod2usage(1) unless
to avoid seeing help info if you are sure you provide the right input parameters. Make sure you provide all input parameters even if you start from a mid-stage, and provide absolute paths for the files. I see parse_bam.pl
file has been added to the repo, probably after your message.
Hope these help some people.
Best. Ömer
@bounlu Hi bounlu,
How did you run the script 'run_clean_htseq.pl ./output/cellular c c2 __' successfully? There seems to be a bug in this program at line #30. Thanks.
Best, dong
@rained1989 Hi,
It was long time ago that I don't remember, but maybe I can recall if you post the error you get.
Best.
@bounlu Hi bounlu,
Thanks. The author fixed the error by adding a new script 'clean_htseq.pl'.
Best.
@bounlu Hi, I'm having troubles with TopHat2 alignment. I saw that you changed this step by using STAR. Can you please dig a little bit into this? Which files did you used for this? How did you changed the main program to use STAR?
Thank you.
@jcarlosangel
Hi,
I did the alignment with STAR
by myself for each sample with the default parameters. The only output you need is the .bam file. Then I changed the erv_genome.pl
script as follows:
my $bam = "Aligned.sortedByCoord.out.bam" # "accepted_hits.bam";
This way you can skip alignment step with Tophat in stage5, just let the indexing:
# tophat
if ($stage <= 5 && $stage2 >= 5) {
# &cmd($aln_cmd);
chdir("GRCh38");
&cmd($bamindex_cmd2);
}
Then you can run it as follows. Note that you still need to give the full input parameters (--tophat, --transcriptome) even if you are not using them so that the script does not complain:
erv_genome.pl --stage 1 --stage2 6 --fastq S1.fastq.gz --btrim btrim64 --tophat tophat2 --bwa bwa --samtools samtools --filter parse_bam.pl --bedtools bedtools --genome hg38.fa --genome_Bowtie2 hg38 --bed ERVmap.chr.bed --genomefile hg38.chrom.sizes --gtf hg38.genes.gtf --transcriptome hg38_tophat --adaptor illumina_adapter.txt --cell S1
@bounlu Thank you so much, I appreciate your help.
@hchintalapudi
Yes, indeed it is the perl version of the R code normalize_deseq.r
. More simply, it's the equivalent of counts(dds, normalized=T)
in DESeq2.
@bounlu That's a relief! So I changed a lot of params like you people before reading this thread. Like: replaced Tophat with STAR and broke up the source code to separate parts and such.. And so I was worried for not following the downstream cleaning scripts. So you're saying that dividing the ERV counts by the cellular counts' size factors (these cellular counts generated from STAR alignments and read counting) is accurate right? And this final ERV matrix can be used for analysis? Thank you.
@hchintalapudi Ohh yes you are right that you still need to divide the ERV counts by the cellular counts' size factors you got from DESeq2 and that's it with ERVmap part, please ignore my previous post which I was not clear. Next, what I did was to calculate the average fold change between the test and the control replicates to define up and down regulated ERVs. Then I handed the list to the wet lab people for validation, which was in the era when no corona was around.
Thank you @bounlu ! Your feedback was helpful. Even I did everything in the pre-Corona era. :D I applied DESeq function after supplying these size factors to the dds object and then went on with the diff exp analysis
Hi @hchintalapudi ! So, did you normalized once you were performing the diff exp analysis with DESeq? or did you use the final default EVRmap counts file to perform it? I'm asking because when I tried to replicate the results reported in ERVmap paper there were a lot of inconsistencies! Sincerely don't know why since I didn't change anything from the core script.
Thank you!
Hi @jcarlosangel, I used raw ERV counts and supplied the size factors (from cellular genes) in DESeq2 for normalization. Hope that helps!
@jcarlosangel
Hi,
I did the alignment with
STAR
by myself for each sample with the default parameters. The only output you need is the .bam file. Then I changed theerv_genome.pl
script as follows:
my $bam = "Aligned.sortedByCoord.out.bam" # "accepted_hits.bam";
This way you can skip alignment step with Tophat in stage5, just let the indexing:
# tophat if ($stage <= 5 && $stage2 >= 5) { # &cmd($aln_cmd); chdir("GRCh38"); &cmd($bamindex_cmd2); }
Then you can run it as follows. Note that you still need to give the full input parameters (--tophat, --transcriptome) even if you are not using them so that the script does not complain:
erv_genome.pl --stage 1 --stage2 6 --fastq S1.fastq.gz --btrim btrim64 --tophat tophat2 --bwa bwa --samtools samtools --filter parse_bam.pl --bedtools bedtools --genome hg38.fa --genome_Bowtie2 hg38 --bed ERVmap.chr.bed --genomefile hg38.chrom.sizes --gtf hg38.genes.gtf --transcriptome hg38_tophat --adaptor illumina_adapter.txt --cell S1
quoting this answer after a long time, just wanted to ask is there anyway I could make tophat to run in the script ? as I dont have enough RAM to run STAR. or any solution to it ?
Hi,
I run the bioinformatics core at La Jolla Institute for Immunology (lji.org), and one of our researchers is interested in running your tool. I've had a quick look through the repo and was unable to find any documentation in terms of installing or using the code. Please let me know if you can provide this.
Thanks very much,
Jason