Duplicate genes - Githubissues

ppgardne commented 7 years ago

fastq.stats.txt CP011972-tiled.out.CP011972.tradis_gene_insert_sites.txt

Hi Folks,

I hope all is well in Sanger-land.

We've been working with TraDIS data in Pseudomonas syringae. Especially this beast:
https://www.ebi.ac.uk/ena/data/view/CP011972

Two of the genes stood out to my collaborators HopAM1-1 (IYO_023205) and HopAM1-2 (IYO_008385). These are 100% identical copies of a 830 nucleotide ORF. In our TraDIS experiments no insertions were found in either gene, but they aren't thought to be essential. I ran a little test to see what happened to insertions in these genes. I simulated a fastq file that tiled 120mers from each position across the gene. I then ran the "bacteria_tradis" and "tradis_gene_insert_sites" pipelines on these. The genes both come back as having 0 insertions and "ins_index" values of zero.

Stats:
File                                hopam.fastq
Total Reads             1424
Reads Matched               1424
% Matched               100
Reads Mapped                1424
% Mapped                100
Unique Insertion Sites : CP011972   0
Seq Len/UIS : CP011972          NaN
Total Unique Insertion Sites        0
Total Seq Len/Total UIS         NaN

100% of the reads map to the genome, it looks like SMALT does sensible things here. The failure appears to be in the way Bio-Tradis counts insertion sites. Is it ignoring multi-mapped reads, or perhaps the parser doesn't recognise them.

I've since scaled this test up to tile reads across the entire genome, then identify the which genes/regions can actually be probed by your pipeline. Please excuse my awful code, but it gets the job done.

Generate a fastq that tiles 120mer reads across the entire genome (forward and reverse). Warning: generates a 5G file.

cat CP011972.2.fasta      | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 1000+$cnt, 1000+$cnt;}' > CP011972-tiled.fastq
revcomp CP011972.2.fasta  | grep -v '^>' | tr -d "\n" | tr "a-z" "A-Z" | perl -lane '$l=120; $sq=$_; for($i=0; $i<(length($_)-$l+1); $i++){$s = substr($sq, $i, $l); $cnt++; printf "\@M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\nGACAGTCGAC$s\n+M00100:233:000000000-BFM3J:1:1102:15255:%d 1:N:0:GTGAAA\n>1AA1111BCDBB3BBDFGB3BBB01BD3D1AAB1B1BBD2DABDA1B1//AAAB//011DDGDD2A2BDB2DDDDGHDA>//BD2D1/////11/B?>B2BB1B/>//?1BB?0/B1BBGBBBFGHH?B\n", 7*(10**6)+$cnt, 7*(10**6)+$cnt;}' >> CP011972-tiled.fastq

echo "CP011972-tiled.fastq" > fastq.infiles

Run "bacteria_tradis":

bacteria_tradis -f fastq.infiles -t 'GACAGTCGAC' -r CP011972.2.fasta  --smalt_s 1 --smalt_k 10 --smalt_r 0 --smalt_y 0.8 -mm 5 -v

*Make the stats file readable:

grep ^File fastq.stats | tr "," "\n" > blah && grep -v ^File fastq.stats | tr "," "\n" > blah2 && paste blah blah2

File                                  CP011972-tiled.fastq
Total Reads                13110900
Reads Matched                  13110900
% Matched                  100
Reads Mapped                   13110900
% Mapped                   100
Unique Insertion Sites : CP011972      6101020 (length: 6555569)
Seq Len/UIS : CP011972             1.0745037715005
Total Unique Insertion Sites           6101020
Total Seq Len/Total UIS            1.

0745037715005

Call essential genes:

tradis_gene_insert_sites  -o tradis_gene_insert_sites CP011972.2.embl CP011972-tiled.out.CP011972.insert_site_plot

The CP011972-tiled.out.CP011972.tradis_gene_insert_sites file (attached) also revealed some interesting results.

*None of the "ins_index" values were 1.0 and "gene_length" and "ins_count" almost always differ by 1 i.e. gene_length = ins_count-1.

*335 genes had an "ins_index" less than 0.1. I think all the values should be ~1.0 in this test. There are 5883 genes in total, so roughly 6% all the PSA genes are not able to be studied with your pipeline.

It'd be great if someone could take a look at this, otherwise I'll need to find a workaround.

Best wishes, Paul.

lbarquist commented 7 years ago

Hi Paul,

Have you tried using the "-m 0" option in bacteria_tradis? This should use reads with a mapping quality of 0 in your analysis (i.e. multi-mapping reads), and is discussed in the tutorial PDF.

(edit: was wrong about the -e, our current terminology is correct)

ppgardne commented 7 years ago

There's a tutorial PDF?! ;-)

That made a big difference. Shouldn't this be the default mode?

File                               CP011972-tiled.fastq
Total Reads             13110900
Reads Matched               13110900
% Matched               100
Reads Mapped                13110900
% Mapped                100
Unique Insertion Sites : CP011972   6504510
Seq Len/UIS : CP011972          1.00784978422664
Total Unique Insertion Sites        6504510 (length: 6555569)
Total Seq Len/Total UIS         1.00784978422664

All genes now have an ins_index > 0.83. The off by one thing may still be a very minor issue.

Many thanks for your help! P.

lbarquist commented 7 years ago

Just to follow up on this briefly now that I've had some coffee:

Our default parameters are for comparative experiments, i.e. between two conditions. In this case we don't consider multi-mapping reads -- since we can't uniquely identify where these reads originate, it should be generally more conservative not to consider them. For example, if we did we might falsely call all repeats of a particular gene as required when only a single copy is actually under selective pressure. In this case, your analysis would be correct, and there's approximately 6% we can't say much about -- this problem would probably be shared by any short-read sequencing technology (e.g. RNA-seq). You could potentially get some more info in these cases by looking at orthology groups, but this isn't something we've implemented.

It sounds like you're trying to do an essentiality study. In this case we have a special flag (-e), which sets --smalt_r 0 -m 0, so bacteria_tradis uses multimapping reads. In this case using everything is most conservative in terms of calling essential genes - basically all identical repeated proteins will be classified as non-essential as long some of them are non-essential. I suppose there might be catastrophic cases where you have 100 copies of a gene, and only one is non-essential without sufficient insertion density to "flip" the others, but I doubt this would happen very often in reality.

If you're still observing this behavior after setting --smalt_r 0 -m 0, then we have a problem. Hope this helps.

lbarquist commented 7 years ago

Cool, glad it's working now! I'll take a look at the off by one thing in the insertion index calculation when I've got some time -- I don't think it should have much of an effect on downstream results, but obviously not ideal.

sanger-pathogens / Bio-Tradis

Duplicate genes #81