shendurelab / MPRAflow

A portable, flexible, parallelized tool for complete processing of massively parallel reporter assay data
Apache License 2.0
31 stars 16 forks source link

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas #68

Closed nli8888 closed 2 years ago

nli8888 commented 2 years ago

Hi,

We are trying to run association.nf with some of our own sequences, but run into issues during the map_element_barcodes part. The pipeline works with the example sequences from https://mpraflow.readthedocs.io/en/latest/association_example1.html so we think there is something wrong with ours.

The error that is returned is AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

From looking at previous issues raised, this issue has occured before. However judging by the posts it seems to be data specific.

The test data that we tried using are attached. We tried to match our data format as closely as possible to the example. The command to run was:

nextflow run association.nf -w /home/ubuntu/Assoc_Basic/work --fastq-insert "/home/ubuntu/Assoc_Basic/data/test_1.fastq.gz" --fastq-insertPE "/home/ubuntu/Assoc_Basic/data/test_3.fastq.gz" --fastq-bc "/home/ubuntu/Assoc_Basic/data/test_2.fastq.gz" --design "/home/ubuntu/Assoc_Basic/data/500sequences.fa" --name assoc_basic --outdir /home/ubuntu/Assoc_Basic/output

500sequences.fa.gz test_1.fastq.gz test_2.fastq.gz test_3.fastq.gz

visze commented 2 years ago

thanks. I will download the data and try to debug it on my side.

visze commented 2 years ago

Hi Nick,

before running the workflow I found something in your data that is probably an issue. I recognized that your designed sequences are 270 bp long. Your FW and REV reads have lengths of 132 and 121 bp. So you do not cover the whole insert. Theoretically this is OK when your sequences different on the end from each other. When there are sequences identical, just different in the middle, e.g. when testing variants, you will not be able to correctly align your data.

Practically it is an issue how MPRAflow works because for PE reads we try to merge both reads before mapping. See here: https://github.com/shendurelab/MPRAflow/blob/fb359522be58bf1ddafe45e313b98d582a43bf99/association.nf#L331 This will not be possible with your data.

A workaround will be using the SE option discarding the reverse read. But then your design has to be different in the first 132 bp, Also cigar string has to be adjusted E.g. using 132M.

It would be the same issue with MPRAsnakeflow. But I think it is possible to use proper PE mapping without read merging.

I am not sure if this triggers the AttributeError issue.

Best, Max

nli8888 commented 2 years ago

Thanks for the reply. I suppose we'll try out the SE mode first in the meantime.

What do you mean exactly when you say "design has to be different in the first 132 bp"?

visze commented 2 years ago

You read must uniquely map to one sequence. If this is not the case the BC will be discarded because of ambiguity. The 132 refers to the length of your first read, staring sequencing your design file from the 5' end.

visze commented 2 years ago

btw. the error is due to zero reads after merging

nli8888 commented 2 years ago

Ah ok I understand. Sorry I thought you meant something else.

Ok so we have gotten SE mode running and we'll see how that goes. If we need to run PE mode, we may try to manually trim the missing middle section in our design file . For future seqs, we'll just make them shorter to ensure full coverage.

Alternatively, we may try MPRAsnakeflow if it is indeed possible to use to PE mapping without merging?

But we'll see which way provides the best results for our purposes

visze commented 2 years ago

Right now MPRAsnakeflow merges reads, too. If more people generate such data I think it is worth implementing it. And I am always happy if someone implements it by themself and makes a pull request to the repo

visze commented 2 years ago

Ok so we have gotten SE mode running and we'll see how that goes. If we need to run PE mode, we may try to manually trim the missing middle section in our design file . For future seqs, we'll just make them shorter to ensure full coverage.

Manually trimming the middle section of the design file will not help you, because the PE reads still do not overlap. Another option is merging the reads by yourself, inserting Ns in between. After that you can run the single ended mode. Nit 100% sure if it works but maybe this is the best option!

nli8888 commented 2 years ago

Ok thanks. We have some outputs now though there's some other issues that we have with them. I'll close this issue and repost another to keep things organised.