wandreopoulos / deeplasmid

17 stars 2 forks source link

The question about removing sequences with 90% identity and 90% length coverage from the IMG test dataset #15

Open ZoeLct opened 4 months ago

ZoeLct commented 4 months ago

In your article, you mention: 'For testing on independent data, we removed from the IMG test dataset any sequence that had 90% identity along 90% of length coverage with any sequence in the training dataset, leaving us with 3280 sequences.' I would like to know specifically how you did this. What bioinformatics software or related scripts did you use?" Looking forward to your reply.

wandreopoulos commented 4 months ago

Hello: For this we used Blast. You can use the Blast parameters for 90% identity and 90% query length to find if any sequence matches highly with another sequence in the database.

On Fri, May 31, 2024 at 5:23 AM ZoeLct @.***> wrote:

In your article, you mention: 'For testing on independent data, we removed from the IMG test dataset any sequence that had 90% identity along 90% of length coverage with any sequence in the training dataset, leaving us with 3280 sequences.' I would like to know specifically how you did this. What bioinformatics software or related scripts did you use?" Looking forward to your reply.

— Reply to this email directly, view it on GitHub https://github.com/wandreopoulos/deeplasmid/issues/15, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANGW5LZ4DDIUDGS3S6JEILZFBTVNAVCNFSM6AAAAABISW4CJWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZDONZSGI4DOOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Thanks, Bill


William B. Andreopoulos, Ph.D. Joint Genome Institute LBNL