Open sunta3iouxos opened 11 months ago
just to add that the bam file header is: the added E.coli chromosome is named Chromosome. Do I need to add this one in the --spikeinExt option?
Chromosome entries in the bam files.
for bam in /mnt/c/AP01/bamSpikes/filtered_bam/*bam; do echo $bam; samtools view $bam | grep Chromosome | wc -l; done
/mnt/c/AP01/bamSpikes/filtered_bam/A006200317_201074_S18_L000.filtered.bam
3354
/mnt/c/AP01/bamSpikes/filtered_bam/A006200317_201076_S19_L000.filtered.bam
2853
/mnt/c/AP01/bamSpikes/filtered_bam/A006200317_201078_S20_L000.filtered.bam
3826
/mnt/c/AP01/bamSpikes/filtered_bam/A006200317_201080_S21_L000.filtered.bam
1675
/mnt/c/AP01/bamSpikes/filtered_bam/A006200317_201082_S22_L000.filtered.bam
3535
/mnt/c/AP01/bamSpikes/filtered_bam/A006200317_201084_S23_L000.filtered.bam
1804
Hi Sunta3iouxos,
your workflow in general looks good, i.e. createIndices -> DNA-mapping -> ChIPseq.
What does grep -c _spikein CUT-RUNTools-2.0/assemblies/mm10_gencodeM19_spikes/genome_fasta/genome.fa
return?
It looks like the _spikein
postfix was not automatically appended by createIndices
.
Best wishes,
Katarzyna
0 but
grep -c Chromosome CUT-RUNTools-2.0/assemblies/mm10_gencodeM19_spik
es/genome_fasta/genome.fa
1
What does the chromosome name in the bacterial fasta look like? Are there any spaces in it? This typically causes issues.
grep ">" CUT-RUNTools-2.0/assemblies/EB1/Sequence/WholeGenomeFasta/
genome.fa
>Chromosome
grep -c " " CUT-RUNTools-2.0/assemblies/EB1/Sequence/WholeGenomeFas
ta/genome.fa
0
I added the
--spikeinExt Chromosome
and it does not complain. Is this correct then?
P.S. There are some other warnings, and messages that require new tickets.
You mean you passed this to ChIP-seq
? It might actually work.
Still, something went wrong in the createIndices
instance. Have a look what the full chromosome name looks like, whether there are any characters in there that might bork up the renaming.
With snakePipes 2.7.2
, I was able to create a hybrid genome with createIndices
as:
createIndices -o $output/GRch38_Ecoli --tools bowtie2 --genomeURL /$genome/genome.fa --gtfURL $genes/genes.gtf --spikeinGenomeURL $genome/Ecoli_C3103_clean.fasta --userYAML GRCh38_g31_Ecoli
The chromosome name in the E.coli fasta was renamed from >CP053595.1 Escherichia coli strain T7Express_LysYIq chromosome, complete genome
to >CP053595.1_spikein Escherichia coli strain T7Express_LysYIq chromosome, complete genome
.
If you'd like to troubleshoot what happens with the chromosome names in the createIndices` worflow, you can pass `--snakemakeOptions ` --notemp `
, which would keep the temporary intermediate files.
Best wishes,
Katarzyna
You mean you passed this to
ChIP-seq
? It might actually work.
Yes, using the
. Spikein chromosome extention can be specified with --spikeinExt.
Still, something went wrong in the
createIndices
instance. Have a look what the full chromosome name looks like, whether there are any characters in there that might bork up the renaming
If you follow a bit the provided information, the _spikein identifier is there but only in the "/spikein_genes.gtf " but nowhere else (at least not in the readable files (so excluding the binary index bowtie2 files.
The genome.fa of the bacteria is as follows:
head CUT-RUNTools-2.0/assemblies/EB1/Sequence/WholeGenomeFasta/genome.fa
>Chromosome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
ACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGT
AACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG
CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGT
ACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTG
GCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAA
I do not see any specific error. The generated genome.fa in the spiked-in folder looks like:
grep ">" CUT-RUNTools-2.0/assemblies/mm10_gencodeM19_spikes/genome_fasta/genome.fa
>chr1 1
>chr2 2
>chr3 3
>chr4 4
>chr5 5
>chr6 6
>chr7 7
>chr8 8
>chr9 9
>chr10 10
>chr11 11
>chr12 12
>chr13 13
>chr14 14
>chr15 15
>chr16 16
>chr17 17
>chr18 18
>chr19 19
>chrX X
>chrY Y
>chrM MT
>GL456210.1 GL456210.1
>GL456211.1 GL456211.1
>GL456212.1 GL456212.1
>GL456213.1 GL456213.1
>GL456216.1 GL456216.1
>GL456219.1 GL456219.1
>GL456221.1 GL456221.1
>GL456233.1 GL456233.1
>GL456239.1 GL456239.1
>GL456350.1 GL456350.1
>GL456354.1 GL456354.1
>GL456359.1 GL456359.1
>GL456360.1 GL456360.1
>GL456366.1 GL456366.1
>GL456367.1 GL456367.1
>GL456368.1 GL456368.1
>GL456370.1 GL456370.1
>GL456372.1 GL456372.1
>GL456378.1 GL456378.1
>GL456379.1 GL456379.1
>GL456381.1 GL456381.1
>GL456382.1 GL456382.1
>GL456383.1 GL456383.1
>GL456385.1 GL456385.1
>GL456387.1 GL456387.1
>GL456389.1 GL456389.1
>GL456390.1 GL456390.1
>GL456392.1 GL456392.1
>GL456393.1 GL456393.1
>GL456394.1 GL456394.1
>GL456396.1 GL456396.1
>JH584292.1 JH584292.1
>JH584293.1 JH584293.1
>JH584294.1 JH584294.1
>JH584295.1 JH584295.1
>JH584296.1 JH584296.1
>JH584297.1 JH584297.1
>JH584298.1 JH584298.1
>JH584299.1 JH584299.1
>JH584300.1 JH584300.1
>JH584301.1 JH584301.1
>JH584302.1 JH584302.1
>JH584303.1 JH584303.1
>JH584304.1 JH584304.1
>Chromosome
in the genome.fa.fai the correct size is reported:
grep Chromosome CUT-RUNTools-2.0/assemblies/mm10_gencodeM1
9_spikes/genome_fasta/genome.fa.fai
Chromosome 4686137 2776387558 60 61
INTERESTINGLY
grep Chromosome /home/tgeorgom/CUT-RUNTools-2.0/assemblies/mm10_gencodeM19_spikes/annotation/spikein_genes.gtf | head -n 1
**Chromosome_spikein** ena CDS 190 252 . + 0 exon_number "1"; gene_biotype "protein_coding"; gene_id "ECDH10B_0001"; gene_name "thrL"; gene_source "ena"; gene_version "1"; p_id "P2524"; protein_id "ACB01206"; protein_version "1"; transcript_biotype "protein_coding"; transcript_id "ACB01206"; transcript_name "thrL-1"; transcript_source "ena"; transcript_version "1"; tss_id "TSS3237";
I can check if I can reproduce this issue here.
Hi there, any updates on this one? did you have time to check this?
Hi,
could you send me the link to this bacterial genome fasta?
Best wishes,
Katarzyna
Hi Katarzyna and everyone else, I can not run the spikeIns strings for Chip-seq, this is a CUT&RUN protocol that has the bacteria cary over that can be used to "normalize the samples". this is the full error:
The steps I have followed:
and this is my hybrid genome yalm mambaforge/envs/snakePipes/lib/python3.11/site-packages/snakePipes/shared/organisms/mm10_gencodeM19_spikes.yaml
indexed genome: