Closed nli8888 closed 2 years ago
thanks. I will download the data and try to debug it on my side.
Hi Nick,
before running the workflow I found something in your data that is probably an issue. I recognized that your designed sequences are 270 bp long. Your FW and REV reads have lengths of 132 and 121 bp. So you do not cover the whole insert. Theoretically this is OK when your sequences different on the end from each other. When there are sequences identical, just different in the middle, e.g. when testing variants, you will not be able to correctly align your data.
Practically it is an issue how MPRAflow works because for PE reads we try to merge both reads before mapping. See here: https://github.com/shendurelab/MPRAflow/blob/fb359522be58bf1ddafe45e313b98d582a43bf99/association.nf#L331 This will not be possible with your data.
A workaround will be using the SE option discarding the reverse read. But then your design has to be different in the first 132 bp, Also cigar string has to be adjusted E.g. using 132M.
It would be the same issue with MPRAsnakeflow. But I think it is possible to use proper PE mapping without read merging.
I am not sure if this triggers the AttributeError
issue.
Best, Max
Thanks for the reply. I suppose we'll try out the SE mode first in the meantime.
What do you mean exactly when you say "design has to be different in the first 132 bp"?
You read must uniquely map to one sequence. If this is not the case the BC will be discarded because of ambiguity. The 132 refers to the length of your first read, staring sequencing your design file from the 5' end.
btw. the error is due to zero reads after merging
Ah ok I understand. Sorry I thought you meant something else.
Ok so we have gotten SE mode running and we'll see how that goes. If we need to run PE mode, we may try to manually trim the missing middle section in our design file . For future seqs, we'll just make them shorter to ensure full coverage.
Alternatively, we may try MPRAsnakeflow if it is indeed possible to use to PE mapping without merging?
But we'll see which way provides the best results for our purposes
Right now MPRAsnakeflow merges reads, too. If more people generate such data I think it is worth implementing it. And I am always happy if someone implements it by themself and makes a pull request to the repo
Ok so we have gotten SE mode running and we'll see how that goes. If we need to run PE mode, we may try to manually trim the missing middle section in our design file . For future seqs, we'll just make them shorter to ensure full coverage.
Manually trimming the middle section of the design file will not help you, because the PE reads still do not overlap. Another option is merging the reads by yourself, inserting Ns in between. After that you can run the single ended mode. Nit 100% sure if it works but maybe this is the best option!
Ok thanks. We have some outputs now though there's some other issues that we have with them. I'll close this issue and repost another to keep things organised.
Hi,
We are trying to run association.nf with some of our own sequences, but run into issues during the
map_element_barcodes
part. The pipeline works with the example sequences from https://mpraflow.readthedocs.io/en/latest/association_example1.html so we think there is something wrong with ours.The error that is returned is
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
From looking at previous issues raised, this issue has occured before. However judging by the posts it seems to be data specific.
The test data that we tried using are attached. We tried to match our data format as closely as possible to the example. The command to run was:
500sequences.fa.gz test_1.fastq.gz test_2.fastq.gz test_3.fastq.gz