maximilianh / crisporWebsite

All source code of the crispor.org website
http://crispor.org
Other
68 stars 43 forks source link

Possibility to keep off-targets list of gRNA in repeats ? #42

Closed pebonte closed 1 year ago

pebonte commented 4 years ago

Hi,

We are currently working on identifying gRNA sequences targeting sub-families of transposable elements. The purpose is to test "consensus" sequences that we identified to show that these sequences are globally shared among all the transposable elements from these sub-families.

So for different sub-families, it works really well. We create a query with the "consensus" sequence that we want to test, and among the off-targets, by comparing the localisations, we know for each off-target exactly which transposable elements or genes the off-target comes from.

However, for some sub-families with a higher number of transposable elements (more than 60 000), neither the website or the command line tools allows me to retrieve the tabular list of off-targets.

I tried to change some parameters in the crispor.py in order to not have a filter on Repeats but I still can't retrieve the off-targets data (whereas the gRNA is present in the guideRNA tabular file).

Do you have any idea how I could retrieve the off-targets from the gRNA with many off-targets ? (I also "played" with --maxOcc --minAltPamScore but still can't retrieve them.

Thank you for your time and thanks again for this tools.

Best,

Pierre-Emmanuel

maximilianh commented 4 years ago

Alignments against repeats are a difficult topic. They can clog up any pipeline. There are a few global variables to filter them, and some filtering even within BWA itself. There are cases ("isRep") where BWA will mark things as repetitive but won't return all alignments, I never figured out why.

maxOcc is a good start, you can also try to increase MFAC, in crispor.py or play with the -m parameter when running BWA. You can align your guide with BWA alone to see if the alignments are not filtered out by BWA already, in which case there is nothing that crispor can do about them.

Other aligners do less filtering, you can try flashfry, it's specialized on bulk-library design, but probably also does a lot less filtering on the off-target side for repeats, as it doesn't have to keep the webserver under it running.

Let me know if any of these work, I have 2-3 more ideas on aligners that are not limited by repeats.

Also, I'm curious: if you actually use these guides, won't they shred your genome into pieces?

On Tue, Jan 28, 2020 at 1:34 PM Pierre-Emmanuel Bonté notifications@github.com wrote:

Hi,

We are currently working on identifying gRNA sequences targeting sub-families of transposable elements. The purpose is to test "consensus" sequences that we identified to show that these sequences are globally shared among all the transposable elements from these sub-families.

So for different sub-families, it works really well. We create a query with the "consensus" sequence that we want to test, and among the off-targets, by comparing the localisations, we know for each off-target exactly which transposable elements or genes the off-target comes from.

However, for some sub-families with a higher number of transposable elements (more than 60 000), neither the website or the command line tools allows me to retrieve the tabular list of off-targets.

I tried to change some parameters in the crispor.py in order to not have a filter on Repeats but I still can't retrieve the off-targets data (whereas the gRNA is present in the guideRNA tabular file).

Do you have any idea how I could retrieve the off-targets from the gRNA with many off-targets ? (I also "played" with --maxOcc --minAltPamScore but still can't retrieve them.

Thank you for your time and thanks again for this tools.

Best,

Pierre-Emmanuel

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

pebonte commented 4 years ago

Thanks for all of these suggestions ! I'll try them out and let you know if it works.

And for your question, we don't want to use the gRNA to cut but to target and repress the transcription of these elements.

Thanks to CRISPOR we saw that in most cases 80-85% of off-targets are in intergenic regions but 20-15% remains intronic so we know that it is a bit risky.

Also by retrieving the off-targets, we saw that 70% of them are elements from the correct sub-families, the other 30% are off-targets from elements in related sub-families.

So the approach is not perfect but we are trying to improve it.

Thanks again !

pebonte commented 4 years ago

Hi again.

I managed to make it work thanks to your suggestion. At the end my command line was : ./crispor.py mm10 seq.consensus.fa testOutmm10.mine.tab -o testOutmm10Offtargets.mine.tab --maxOcc 500000 --minAltPamScore 0

I have exactly the same results as I have with the website but now I also have the off-targets list for the guides marked as guides targeting Repeats.

I also changed some things in the crispor.py scripts. I don't really know exactly which modifications worked but if someone is interested I can link my crispor.py modified.

Thanks again for your help.

maximilianh commented 4 years ago

Great! yes, I'd be curious to keep your parameters somewhere, in case someone else asks again. Could you post your crispor.py here or send it to me by email? (max@soe.ucsc.edu)

I'm still curious: if you actually use these guides, won't they shred your genome into pieces?

On Thu, Feb 6, 2020 at 2:44 PM Pierre-Emmanuel Bonté < notifications@github.com> wrote:

Hi again.

I managed to make it work thanks to your suggestion. At the end my command line was : ./crispor.py mm10 seq.consensus.fa testOutmm10.mine.tab -o testOutmm10Offtargets.mine.tab --maxOcc 500000 --minAltPamScore 0

I have exactly the same results as I have with the website but now I also have the off-targets list for the guides marked as guides targeting Repeats.

I also changed some things in the crispor.py scripts. I don't really know exactly which modifications worked but if someone is interested I can link my crispor.py modified.

Thanks again for your help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/42?email_source=notifications&email_token=AACL4TKWCBVKKXPAGLHZ7LDRBQH3BA5CNFSM4KMRUPNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK7IM2Q#issuecomment-582911594, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TK6LINNMHRDTBWQDPTRBQH3BANCNFSM4KMRUPNA .

pebonte commented 4 years ago

Hopefully it will just repress and not shred the genome. They are using a different type of Cas9 that should only repress the transcription and not cut the targeted regions. If it works I'll make sure to update you.

And sure, I attach the script below (I changed the extension to .txt because it doesn't work with .py). crispor.txt

maximilianh commented 4 years ago

From what I can see, you set MAXOCC to 1000000 and commented out the line where isRep is set.

I have a bad feeling about the latter: isRep means that there is a mismatch between reported number and reported locations of BWA alignments. I believe this indicates that we didn't get all off-targets from BWA, BWA gave up chasing alignments to save time. So if you switched off the recognition of this reporting-suppression, then this means that you will have some off-target alignments, but you won't have all of them....

On Thu, Feb 6, 2020 at 4:21 PM Pierre-Emmanuel Bonté < notifications@github.com> wrote:

Hopefully it will just repress and not shred the genome. They are using a different type of Cas9 that should only repress the transcription and not cut the targeted regions. If it works I'll make sure to update you.

And sure, I attach the script below (I changed the extension to .txt because it doesn't work with .py). crispor.txt https://github.com/maximilianh/crisporWebsite/files/4165814/crispor.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/42?email_source=notifications&email_token=AACL4TPDBYQBFJPG5RMF57TRBQTHHA5CNFSM4KMRUPNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK7TCFA#issuecomment-582955284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TMBV2QPEXYX2CKBZHTRBQTHHANCNFSM4KMRUPNA .

pebonte commented 4 years ago

So I checked and indeed we are missing some off-targets.

For example, for one gRNA, the website detected 106 610 off-targets whereas in my tabular file, I have 100 085. I checked some others and the proportions seem similar.

I think for now we will not look more deeply into this issue as retrieving the data for more than 90% of the off-targets should be enough to determine if this gRNA target mostly transposable elements from the same sub-family.

Thanks again for your help.

maximilianh commented 4 years ago

I also wanted to note that for an entirely different project, I've built consensuses for all repeat types and alignments with all copies. This would be another way to design guides.

On Wed, Feb 12, 2020 at 1:13 PM Pierre-Emmanuel Bonté < notifications@github.com> wrote:

So I checked and indeed we are missing some off-targets.

For example, for one gRNA, the website detected 106 610 off-targets whereas in my tabular file, I have 100 085. I checked some others and the proportions seem similar.

I think for now we will not look more deeply into this issue as retrieving the data for more than 90% of the off-targets should be enough to determine if this gRNA target mostly transposable elements from the same sub-family.

Thanks again for your help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/42?email_source=notifications&email_token=AACL4TJRO2AQKUOWZGKPE5TRCPRXDA5CNFSM4KMRUPNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELQRSYQ#issuecomment-585177442, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TLU7NAO22MJFAGARDLRCPRXDANCNFSM4KMRUPNA .

pebonte commented 4 years ago

That would be very interesting indeed.

We used this project to obtain our consensuses. For some consensuses, the gRNAs target more than 80% Elements from the right family but for others, they mostly target elements from the right sub-family but also elements from a lot of other sub-families that are close.

When you say repeat types, do you mean at the class level (like LTR, LINE, SINE, DNA), at the family level (L1, ERV1, ERVL...etc), or at the sub-family level (L1_Mus1, RLTR6-int, B2...etc) ?

maximilianh commented 4 years ago

Sorry, I've never heard of monomerAnnotation, no idea what it does. DFAM (used by repeatmasker) has consensuses already, so I'm not sure why one would use a separate software...?

with type, I mean at the repeatmasker annotation level, what repeatmasker calls "repName".

On Wed, Feb 12, 2020 at 2:33 PM Pierre-Emmanuel Bonté < notifications@github.com> wrote:

That would be very interesting indeed.

We used this project https://github.com/mengzhou/MonomerAnnotation to obtain our consensuses. For some consensuses, the gRNAs target more than 80% Elements from the right family but for others, they mostly target elements from the right sub-family but also elements from a lot of other sub-families that are close.

When you say repeat types, do you mean at the class level (like LTR, LINE, SINE, DNA), at the family level (L1, ERV1, ERVL...etc), or at the sub-family level (L1_Mus1, RLTR6-int, B2...etc) ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/42?email_source=notifications&email_token=AACL4TJCYJVTU34433RC2S3RCP3DJA5CNFSM4KMRUPNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELQYY2I#issuecomment-585206889, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TO6532EJIX5IZRQSATRCP3DJANCNFSM4KMRUPNA .

pebonte commented 4 years ago

We used this software from this paper because we wanted to target shared promoter region from elements from several specific sub-families of L1Md.

maximilianh commented 4 years ago

Oh. this is very class-specific. Ok, nevermind then. Looks like you have a very specific application here.

On Wed, Feb 12, 2020 at 3:15 PM Pierre-Emmanuel Bonté < notifications@github.com> wrote:

We used this software from this paper https://link.springer.com/article/10.1186/s13100-019-0156-5 because we wanted to target shared promoter region from elements from several specific sub-families of L1Md.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/42?email_source=notifications&email_token=AACL4TJ2KV2BNMA2M6GKEV3RCQABHA5CNFSM4KMRUPNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELQ5GUA#issuecomment-585225040, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TLOJT6IXF5VMJEFWALRCQABHANCNFSM4KMRUPNA .

maximilianh commented 1 year ago

Before closing this thread, I want to mention repeatbrowser.ucsc.edu, which builds consensus sequences and maps from genome to consensus for you for hg19 and hg38. Let me know if this could be useful one day for your.